pangeo-forge / roadmap

Pangeo Forge public roadmap
Creative Commons Attribution 4.0 International
19 stars 4 forks source link

Brainstorming about overwriting / appending / etc #29

Open rabernat opened 3 years ago

rabernat commented 3 years ago

As noted in https://github.com/pangeo-forge/staged-recipes/issues/67, we need to resolve some details about how bakeries should handle the scenario where data already exists in a target location.

An important principle is that the recipe object, as provided by the recipe box (github repo), should not explicitly know about its target. The recipe describes the source data and how to transform it into ARCO. Therefore, the recipe should not care whether the target data already exists. That's the bakery's problem.

I made this flow chart to start thinking through the logic.

Pangeo Forge Target Update Flow

Edit link: https://lucid.app/lucidchart/invitations/accept/inv_2d95a404-c359-457e-b472-ae10f268e662?viewport_loc=-153%2C372%2C1889%2C1012%2C0_0

Consequences

If we accept this flow, it raises several important points we will have to implement.

cc @cisaacstern @sharkinsspatial

cisaacstern commented 3 years ago

Noting that in today's (2021-08-16) Pangeo Forge Coordination Meeting, we decided to start with a default overwrite/rewrite policy. This may require bakery operators to manually delete existing data, or could perhaps entail targets being versioned, to allow for rewriting rather than overwriting. Further details in the meeting minutes.

simonff commented 3 years ago

Drive-by comment based on the Earth Engine experience: many datasets have the formal notion of data generations. Eg, early / provisional / permanent climate data that get recomputed as more precise inputs arrive, or RT tier vs T1/T2 tiers for Landsat. We have an internal system for annotating such datasets to support overwrite only when the new data belong to a newer generation.

Another consideration is file mtime, if available. If you put the max mtime of input files into the metadata of the ingestion result, you can make an intelligent decision whether or not to overwrite during the next run.