Append-only production runs

cisaacstern commented 2 years ago

User Profile

As a recipe maintainer

User Action

I want to re-run recipes in my feedstock (either manually or on a schedule) to append newly released data to my dataset

User Goal

So that I can keep the dataset built by my feedstock up-to-date with the latest releases from the data provider without needing to re-run the entire recipe

Acceptance Criteria

The ability to trigger append-only production runs (manually or on a schedule) from a feedstock. This might be inferred from the recipe itself, or perhaps specified by a new property in the meta.yaml

Linked Issues

cisaacstern commented 2 years ago

I believe there are only two remaining prerequisites before the actual work of this feature can begin in pangeo-forge-recipes:

Merge https://github.com/pangeo-forge/pangeo-forge-recipes/pull/359 (I believe this is ready to go, but I haven't looked at it in a week or so, so one last review is probably worthwhile)
We'll probably want to revisit the way the recipe hash is generated. As it stands, the file pattern hash is included in the recipe hash calculation. On further thought, I realize that if we do not isolate recipe and pattern hashes, then we don't have any way to confirm that a given recipe, if used to append to a particular existing dataset, shares the same (non-file-pattern) attributes (e.g. target chunks, etc.) as the recipe used to create the existing dataset. This is perhaps less straightforward than simply isolating file pattern and recipe hashes, because the recipe classes do currently contain certain attributes which relate to execution concerns, which could vary without affecting "append compatibility" of two recipe instances.

Once these two items are addressed, work can begin on an appending feature in pangeo-forge-recipes.

sharkinsspatial commented 1 year ago

@rabernat Per our discussion in the call yesterday I'm including some more details on our AWS-ASDI specific use cases here rather than in a new ticket on pangeo-forge-recipes. For many of the reference indexes we are generating for data in the AWS PDS buckets https://github.com/pangeo-forge/staged-recipes/issues/208, we'll need to periodically update the index as new data becomes available. In almost all of our cases, this will involve expanding the index's time dimension. I think our use case is a bit atypical in that most of the buckets where these datasets live have event notifications configured for new keys which allow us to monitor data being added. Originally I had envisioned us queuing these event notifications and periodically sending a block of new files to pangeo-forge for appending to the target archive. This will be great for our use case, but I don't think it generalizes as well for most users. Instead I think we'll likely need a process that

Allows users to configure an update checking cron increment for their recipe.
Uses the last step in the zarr or kerchunk archive and the new date to construct a new FilePattern for the data to append.

This assumes that the recipe's concat dim is temporal and we'd likely need to restrict the append only cron configuration to work for recipes where this is true.

@cisaacstern has linked most of the related issues above but I'll include the more recent Beam specific issue here for tracking as well https://github.com/pangeo-forge/pangeo-forge-recipes/issues/447

pangeo-forge / user-stories