Finalize Github workflow logic for running recipes in feedstocks.

sharkinsspatial commented 3 years ago

The proposed workflow logic for running recipes in feedstock repos is described here https://github.com/pangeo-forge/roadmap/blob/master/doc/adr/0001-github-workflows.md but we should probably review for a consensus decision and address some open questions around versioning, updating existing zarr archives.

The initial naive Github action will register and run the full recipe (replacing an existing zarr store if it exists) for each recipe in the repository when a new release tag is created. While functional, there are several issues with this approach which should be addressed.

The action should really work in a similar fashion to the existing staged-recipes PR testing workflow which only proceeds when the PR includes changes to a meta.yaml file. I'm unsure how diff should be used with a release tag in order to determine the listing of meta.yaml files that have been updated in a release. There is also the possibility that the underlying recipe code can change with no corresponding change to the meta.yaml so I would suggest the possibility of including a root level version property to the meta.yaml spec to facilitate flagging with diff.
The current approach will naively replace the entire existing zarr archive when the recipe is re-run. (Additionally, I would like clarification on how pangeo-forge-recipes will handle the case of writing to object storage when the target archive already exists.). There has been previous discussion about incrementally updating existing archives but this poses a few questions in relation to workflows. How should underlying changes to source data (notably from reprocessing efforts) be handled? How should intermediate cache files invalidation be handled for source data changes?
Should our zarr target structure patterns include some sort of versioning scheme which is tied to our release tagging? This simplifies many aspects of the workflow process but may be confusing for end users when they encounter multiple versions of the archive.
Another open question around feedstock workflows concerns how we should manage recipes for different chunking strategies for the same data. I believe it makes sense that we maintain single feedstocks for a source dataset and that different chunking strategies are treated as unique recipes within the feedstock. With this approach, users could submit PRs for new chunking strategies and using the diff method discussed above we can run only the recipes which were changed in the release.

As discussed in the coordination meeting today. I'll push forward with the naive approach initially so that we can close https://github.com/pangeo-forge/staged-recipes/issues/58. Then @rabernat can schedule a sprint where we can address these open questions and update the workflows.

rabernat commented 3 years ago

There is also the possibility that the underlying recipe code can change with no corresponding change to the meta.yaml so I would suggest the possibility of including a root level version property to the meta.yaml spec to facilitate flagging with diff.

Conda forge recipe.yaml (example) distinguishes between "version" (the actual software version, same as on pypi) and "build". To rebuild a recipe, you can increment "build" but not touch "version." We could consider doing the same.

2. The current approach will naively replace the entire existing zarr archive when the recipe is re-run. (Additionally, I would like clarification on how pangeo-forge-recipes will handle the case of writing to object storage when the target archive already exists.).

For the XarrayZarr recipe, the original dataset will not be deleted. It will follow this code path, trying to open the target dataset. Then it will overwrite all the data variables, but not change the metadata. This is probably not the ideal behavior in all cases. It was done this way to facilitate "append" workflows.

Should our zarr target structure patterns include some sort of versioning scheme which is tied to our release tagging? This simplifies many aspects of the workflow process but may be confusing for end users when they encounter multiple versions of the archive.

I think this would make lots of sense. It would require an update to ADR 3. Keep in mind that there is a layer of abstraction between the raw object store and the user: the catalog.

4. Another open question around feedstock workflows concerns how we should manage recipes for different chunking strategies for the same data.

For now, It think we can handle this simply as distinct recipes. The recipe names would include a reference to the chunking scheme, e.g. noaa_oisst_space_chunks vs noaa_oisst_time_chunks. This is not super elegant, but it would work and allow us to move forward. We can revisit this if it becomes a very common pattern.

I am eager to organize this sprint, but not sure when would be the best time. My week is once again a bit scattered, without a lot of consistent work time. I could possibly do Wednesday (tomorrow) 6-7 pm CET. Otherwise next week would probably be better.

cisaacstern commented 3 years ago

Should our zarr target structure patterns include some sort of versioning scheme which is tied to our release tagging? This simplifies many aspects of the workflow process but may be confusing for end users when they encounter multiple versions of the archive.

I think this would make lots of sense. It would require an update to ADR 3. Keep in mind that there is a layer of abstraction between the raw object store and the user: the catalog.

👍 and seems like this would also require an update to ADR 3:

The recipe names would include a reference to the chunking scheme, e.g. noaa_oisst_space_chunks vs noaa_oisst_time_chunks

Seems like if we're ever going to want it there, then it's easiest to always include chunking in the target path name.

I could possibly do Wednesday (tomorrow) 6-7 pm CET. Otherwise next week would probably be better.

Available Wednesday (tomorrow) 6-7 pm CET, and also flexible next week.

rabernat commented 3 years ago

then it's easiest to always include chunking in the target path name.

Not so sure about this. First of all, the chunking information is trivially accessible from the zarr metadata, so it is easy to discover when building the catalog. Second, I'm not sure how you would "encode" chunking into the path name. But we could try to work this out in a whiteboard session.

rabernat commented 3 years ago

I am not sure if this hack is happening, but I'll join https://whereby.com/pangeo now.

pangeo-forge / staged-recipes

Finalize Github workflow logic for running recipes in feedstocks. #67