Closed cisaacstern closed 2 months ago
Thanks for opening this discussion @cisaacstern. This issue is becoming more important as we add more and more datasets to the library.
I'm pretty sure we don't want to automatically delete paths without human approval
I think this is true for a certain time after running, but I would be fine if data gets forcefully removed after some set period. If people drop the development and want to pick it back up, its fine IMO to rerun a recipe.
A similar point actually stands for successful reruns of a given dataset: Do we want to keep that and retain a naive history? Or do we want to delete all previous runs after a while?
I think this is true for a certain time after running, but I would be fine if data gets forcefully removed after some set period. If people drop the development and want to pick it back up, its fine IMO to rerun a recipe.
Agreed that after a time interval elapses, removing development datasets is okay.
But how to distinguish between development and "final" datasets? It seems paths need to be cross-referenced against a catalog of some sort, because all production build "attempts" are put into the same bucket. So the ones we want to delete would be the ones that fulfill at least two criteria:
And I think there are at least two catalogs something could be in at this point:
Correct?
That seems like the right way to go about it.
Alternatively we could consider a 'moving' stage? E.g. a stage that takes a store and just moves it to another bucket? Then the whole output could be dumped on scratch and we would not have to worry about any of this?
The current design avoids this by building/writing everything in leap-scratch
until the last copy (which overwrites each time to the 'final' url defined in catalog.yaml
).
We output datasets to
leap-persistent-ro
so that users outside LEAP can access them:https://github.com/leap-stc/data-management/blob/a87c3216d9e84afe703a2da90fc97152ecd8bd38/.github/workflows/deploy.yaml#L77
This is a good idea.
But when a Dataflow job fails, we end up with an incomplete dataset in that bucket which we don't need to keep forever. I've tried to run the climsim recipes a bunch, so for example:
(Most of these are pruned tests, so it doesn't represent a lot of data.)
I'm pretty sure we don't want to automatically delete paths without human approval, because even if they are failed jobs, we may want to review them as part of the debugging process.