pangeo-forge / pangeo-forge-gcs-bakery

A repo for building out a pangeo forge bakery in Google Cloud Platform
1 stars 0 forks source link

Can `KubernetesRun.image` and `DaskExecutor...image` differ from Prefect Agent image? #40

Open cisaacstern opened 2 years ago

cisaacstern commented 2 years ago

Dynamically selecting an versioned Docker image for a given recipe execution seems quite important. We currently have multiple released versions of pangeo-forge-recipes, and even if current recipe developers only use the latest version, that doesn't solve the problem of released feedstocks having been built with specific older versions. We always need to be able re-run all released feedstocks.

As a concrete example, our released NOAA OISST feedstock was released with pangeo-forge-recipes==0.5.0, which is already many versions behind the bakery image (pinned to pangeo-forge-recipes==0.6.1) that was used to deploy our current bakery:

BAKERY_IMAGE="pangeo/pangeo-forge-bakery-images:pangeonotebook-2021.07.17_prefect-0.14.22_pangeoforgerecipes-0.6.1"

My question for @tracetechnical and @sharkinsspatial is, should we expect that, if at the time of registration we change the value of image in the KubernetesRun and DaskExecutor...

https://github.com/pangeo-forge/pangeo-forge-gcs-bakery/blob/2d164b4dacb0353d1df11ebf996493e5108bc7bf/test/recipes/oisst_recipe.py#L69-L71

https://github.com/pangeo-forge/pangeo-forge-gcs-bakery/blob/2d164b4dacb0353d1df11ebf996493e5108bc7bf/test/recipes/oisst_recipe.py#L76-L80

...that this will create workers which can execute recipes with a different environment from the Prefect Agent?

As far as I can tell, the only place that the BAKERY_IMAGE used to deploy the bakery is actually used is in the Prefect Agent deployment spec

https://github.com/pangeo-forge/pangeo-forge-gcs-bakery/blob/2d164b4dacb0353d1df11ebf996493e5108bc7bf/kubernetes/prefect-agent.deployment.yaml#L50

...so dynamically setting the image for the KubernetesRun and DaskExecutor should be possible? I've been experimenting with this for ... a few hours now 🙃 ... and not quite able to get it to work yet, so just wanted to check that I wasn't overlooking something obvious.

cc @rabernat @jhamman

sharkinsspatial commented 2 years ago

@cisaacstern I have been searching for the issue which details this but can't seem to locate it. In our discussions with the Prefect team there are 2 versioning compatibility issues to account for. The first is the Python version used for flow serialization, this must match the Python version used on the container where the Prefect agent is running for flow deserialization to function properly. The second is the Prefect Agent version. According to the Prefect support team this should be backward compatible so the Prefect Agent will support flows created with any =< version.

I'd like to find a canonical link to Prefect documentation or an issue discussion which confirms this so we can reference it here.

Dynamically setting the image for KubernetesRun and the DaskExecutor is what happens in pangeo-forge-prefect currently for AKS bakeries so I can confirm this should be working, what is the error you are seeing in your environment. See here and here for examples of this dynamic registration.

sharkinsspatial commented 2 years ago

@cisaacstern For reference this is a previous comment where I discussed some of our image versioning concerns https://github.com/pangeo-forge/roadmap/issues/18#issuecomment-840901574

cisaacstern commented 2 years ago

Thanks for weighing in, @sharkinsspatial. I now have a version of this idea working in https://github.com/pangeo-forge/registrar/pull/12.

what is the error you are seeing in your environment

I was seeing an version incompatibility problem in the worker container (the specifics might not matter, but it was an s3fs + aiobotocore mismatch). Upgrading fsspec/s3fs/gcsfs in the 0.5.0 worker image seems to have solved this; xref https://github.com/pangeo-forge/pangeo-forge-bakery-images/pull/28.

I am now unsure if this was actually an image incompatibility problem, or rather that the prior release of the 0.5.0 image was itself broken, and required an fsspec/s3fs/gcsfs upgrade just to work at all. These two scenarios are hard to disambiguate now that I have the following three images

graph LR;
    registrar -->  agent --> worker;

all using fsspec/s3fs/gcsfs == 2011.11.0.

What I'm not clear on, however, is what would happen if one or multiple of the above images had different fsspec/s3fs/gcsfs versions than the others. We will encounter this scenario quite soon, though, because recent releases of pangeo-forge-recipes use later versions of fsspec et al than 2011.11.0.