leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 5 forks source link

Deletion system for failed/incomplete datasets in `leap-persistent-ro` #47

Closed cisaacstern closed 2 months ago

cisaacstern commented 10 months ago

We output datasets to leap-persistent-ro so that users outside LEAP can access them:

https://github.com/leap-stc/data-management/blob/a87c3216d9e84afe703a2da90fc97152ecd8bd38/.github/workflows/deploy.yaml#L77

This is a good idea.

But when a Dataflow job fails, we end up with an incomplete dataset in that bucket which we don't need to keep forever. I've tried to run the climsim recipes a bunch, so for example:

import gcsfs
gcs = gcsfs.GCSFileSystem()
paths = [f for f in gcs.ls("gs://leap-persistent-ro/data-library/")]
paths
``` ['leap-persistent-ro/data-library/climsim-highres-mli-595733423-5686203778-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5731029632-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5731317955-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5733330020-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5733330026-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5733330028-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5733330056-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5744860208-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5803713446-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5803713466-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5803713492-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5803756587-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5869580856-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5869580857-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5869580881-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5871053468-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5882088909-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5882088910-1', 'leap-persistent-ro/data-library/climsim-highres-mli-595733423-5882088928-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5686102662-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5719546748-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5719548964-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5719740052-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5720516328-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5722088926-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5731029632-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5731317955-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5733330020-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5733330026-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5733330028-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5733330056-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5744860208-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5803713446-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5803713466-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5803713492-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5803756587-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5804522442-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5869580856-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5869580857-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5869580881-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5871053468-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5882088909-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5882088910-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5882088928-1', 'leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5882522942-1', 'leap-persistent-ro/data-library/cmip6-testing', 'leap-persistent-ro/data-library/data-library-cmip'] ```

(Most of these are pruned tests, so it doesn't represent a lot of data.)

I'm pretty sure we don't want to automatically delete paths without human approval, because even if they are failed jobs, we may want to review them as part of the debugging process.

jbusecke commented 10 months ago

Thanks for opening this discussion @cisaacstern. This issue is becoming more important as we add more and more datasets to the library.

I'm pretty sure we don't want to automatically delete paths without human approval

I think this is true for a certain time after running, but I would be fine if data gets forcefully removed after some set period. If people drop the development and want to pick it back up, its fine IMO to rerun a recipe.

A similar point actually stands for successful reruns of a given dataset: Do we want to keep that and retain a naive history? Or do we want to delete all previous runs after a while?

cisaacstern commented 10 months ago

I think this is true for a certain time after running, but I would be fine if data gets forcefully removed after some set period. If people drop the development and want to pick it back up, its fine IMO to rerun a recipe.

Agreed that after a time interval elapses, removing development datasets is okay.

But how to distinguish between development and "final" datasets? It seems paths need to be cross-referenced against a catalog of some sort, because all production build "attempts" are put into the same bucket. So the ones we want to delete would be the ones that fulfill at least two criteria:

And I think there are at least two catalogs something could be in at this point:

  1. The data library catalog in this repo
  2. The cmip6 big query catalog

Correct?

jbusecke commented 10 months ago

That seems like the right way to go about it.

Alternatively we could consider a 'moving' stage? E.g. a stage that takes a store and just moves it to another bucket? Then the whole output could be dumped on scratch and we would not have to worry about any of this?

jbusecke commented 2 months ago

The current design avoids this by building/writing everything in leap-scratch until the last copy (which overwrites each time to the 'final' url defined in catalog.yaml).