Identifying retracted CMIP6 data in the Zarr holdings and removing them

matthew-mizielinski commented 2 years ago

Hi,

As mentioned at the last meeting there is a certain amount of Met Office data that I have retracted from the CMIP6 project on ESGF that is listed as available in the Pangeo cloud Zarr holdings on AWS (I have not checked the Google cloud listings). There are a range of reasons why we have retracted data, from faulty coordinate information to issues with the setup of the models, e.g. faulty ancillary input files in some UKESM1 simulations.

While I could provide a list of all datasets with the institution id "MOHC" that I have retracted, I know this could miss data published by other institutions, so I've put together an example of how you could iterate over all datasets to (a) look for connected Errata and (b) attempt to look up the status of datasets on ESGF via the handle service.

Link to demonstration notebook

Note that this method of querying the Errata service, and then the handle service, depends on Errata being issued by the modelling groups that own the data, and there are likely some gaps (i.e. retracted datasets for which there is not a corresponding errata entry). The existence of an errata does not indicate whether affected data has been retracted; there are wont-fix errata which detail an issue that the modelling group does not intend to address or does not require the retraction of data, e.g. minor metadata issues and the extension of datasets in time. The method I've used in the example code linked above is not necessarily the quickest or most efficient. If you have access to the files themselves and the tracking_id field from within the global attributes then a similar method could be applied that does not rely on the Errata service.

I am particularly keen, given the popularity of this service, that the status of data presented here is consistent with that in the CMIP6 archives served by ESGF. Let me know if there is any further information that would be helpful here, and please point me in the appropriate direction if this is not the best place for this request.

Matt Mizielinski Climate Data Delivery Manager, Met Office Hadley Centre WGCM Infrastructure Panel co-chair

rabernat commented 2 years ago

Matt this is a great place for the request. Thanks a lot for your helpful comments. I agree completely with everything you say.

We used to have a (semi-manual) process in place for removing retracted data. But the person who managed it has retired. (Pinging @naomi-henderson with no expectation that she will reply.) So we need to transition to a more automated system.

@jbusecke - do you see a way to take @matthew-mizielinski's notebook and turn it into a cron job we can run with github workflows?

matthew-mizielinski commented 2 years ago

Thanks Ryan,

If you do start sending large numbers of queries to the Errata and Handle services through code such as this do let me know -- I'd like to keep the developers of those systems in the loop so that they can look at how well their services are coping with the traffic (I've no idea whether what will be needed here is a large amount of traffic or just a drop in the ocean to those services, but might be worth checking).

rabernat commented 2 years ago

If you have access to the files themselves and the tracking_id field from within the global attributes then a similar method could be applied that does not rely on the Errata service.

Fortunately our archive does keep track of the tracking_ids that went into each Zarr

The original datasets, stored in the WCRP/CMIP6 ESGF repositories, consist of netCDF files. Each of these datasets typically corresponds to a single variable saved at specified time intervals for the length of a single model run. For convenience, these datasets were often divided into multiple netCDF files. Our Zarr data stores correspond to the result of concatenating these netCDF files and then storing them as a single Zarr object. All of the metadata in the original netCDF files has been preserved, including the licensing information and the CMIP6 persistent identifiers (tracking_ids) which are unique for each of the original netCDF files.

https://pangeo-data.github.io/pangeo-cmip6-cloud/overview.html#zarr-storage-format

jbusecke commented 2 years ago

Thank you so much for all this work @matthew-mizielinski. I am looking into this now and will report back.

naomi-henderson commented 2 years ago

@matthew-mizielinski @jbusecke and @rabernat , I thought I would also mention that I had not been actually removing datasets from zarr storage which have issues filed through the Errata Service. They only get removed from the main CSV catalog file. There is a separate catalog file, gs://cmip6/cmip6-zarr-consolidated-stores-noQC.csv (noQC meaning no quality control), which DOES still list the datasets with filed errata issues. This is so that folks who have used the datasets prior to the filing of an errata issue can still reproduce and understand their old results. If the datasets disappear there is no way to do forensics on old results ...

rabernat commented 2 years ago

Hi Naomi! Thanks for chiming in from retirement-world! 😉 Sorry to be pestering you with CMIP6-related stuff, but your experience is so helpful.

Could you explain how you had been removing datasets from the catalog? What was your process for querying the errata and cross referencing it with the catalog? Any links to notebooks or scripts would be so helpful.

naomi-henderson commented 2 years ago

Yes, though it is a little complicated.

In ESDOC there is a notebook which attempts to explain the half-automatic way I used to query the Errata Service and update lists of issues. I was using the esgissue command from the esdoc-errata-client package to get all of the current issues. Then the notebook parses the issues as best it can, producing the csv/ES-DOC_issues.csv file for use in generating the catalog, see MakeCloudCat_GC.ipynb. As with all CMIP6 topics, there are many complicating factors, but that is the basic idea ....

So I generate TWO CSV catalogs, one which skips any datasets with serious issues (a very fuzzy definition should be in the notebook) when generating the main catalog and another which has all of the datasets with 3 extra columns added which list the status, severity and issue_url for each dataset. For example, one line in the CSV file is:

HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,tasmin,gn,gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/Amon/tasmin/gn/v20170706/,,20170706,wontfix,medium,https://errata.es-doc.org/static/view.html?uid=9b40a054-21a7-5ae7-a3eb-8c373c5adddc

which indicates that the zarr store gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/Amon/tasmin/gn/v20170706/ has an issue with status='wontfix' and severity='medium' which can be checked at https://errata.es-doc.org/static/view.html?uid=9b40a054-21a7-5ae7-a3eb-8c373c5adddc

jbusecke commented 2 years ago

Hi @naomi-henderson thank you so much. I like the approach of having a 'raw' catalog, which then gets 'masked' for the users.

Let me try to give an update on what I have found today, where I might need some help, and my plans going forward.

Some nomenclature:

dataset A datasets concatenated in time, equivalent to a zarr store in the cloud holdings. The smallest unit of data we have in this catalog.
raw_catalog the no_qc catalog @naomi-henderson mentioned, with everything that was ever uploaded.
public_catalog the 'masked' catalog which is exposed to the public.

Central identification of datasets An important aspect here and for other aspects of working with these datasets is the ability to uniquely identify a dataset. It seems that a dataset_id ( in the form of activityid.institute_id.source_id.experiment_id.variant_label.table_id.variable_id.grid_klabel.version) does that job. From @matthew-mizielinski s notebook it seems like this is sufficient to query errata. This signature is conveniently also our zarr store name. Any objections to using these as unique identifiers of a dataset?

How to identify retractions Thanks to @matthew-mizielinski I think I have a good idea what I need to implement to identify retracted datasets. I have been experimenting with grabbing the handle_id values directly from our zarr stores (instead of querying errata which is quite slow), and this seems to work well! I was able to identify all the retracted datasets from @matthew-mizielinski s example, and some others! I will clean up that code and run it over the full catalog soon.

How to delete/mask the found datasets But first I would like to discuss how we go about the 'deactivation' of the datasets that we identified above? My suggestion (happy to adjust) is the following:

I will set up python code that runs on a github action, scans the full catalog and produces a list of dataset_ids reflecting the most up to date retractions. That file could be commited to the repo and used to later on rebuilding a certain retracted store, if we end up deleting it.
I would then try to adopt @naomi-henderson s approach and run an action that creates a 'filtered' catalog, where we simply omit the datasets that are retracted from the updated public_catalog.

We could implement some sort of check, where a maintainer has to 'sign off' on newly added retractions, or hope that everything is going to work out fine 😜.

I think this approach is generally better than deleting the stores (we can talk about that elsewhere) in case something does go wrong it seems easier to recreate a csv file than reupload data.

Retracted vs issues A connected issue here is the question of how much 'quality control' we want to do for the user here. I think removing retractions is obvious, but I think @naomi-henderson had removed even some other datasets with issues? Any suggestions for a heuristic what 'goes public' and what doesnt?

I personally would favor an approach where we remove as little as possible and maybe implement some functionality on a higher level (e.g. cmip6_pp) where the user could check for issues on a dataset basis (using some of the logic use here). But maybe there is a good reason not to do this.

More general Questions

Somewhere in the pipeline the facet variant_label has been renamed to member_id. This has given me quite the headache over at cmip6_preprocessing, and I still have not been able to understand why this was introduced. Going forward, are we going to carry this depth or opt for a new catalog, which is organized strictly by facet names which are also used in the dataset attributes.
Uncertainty in version number due to time concatenation. What do we do if only a subset of files from a 'dataset'/zarr store are retracted? I guess for now we have to remove the store if any of them is bad (that is what I have implemented so far). In the future we might want to think about an automatic retrigger mechanism that would try to rebuild that dataset specific store.

Thats it for today. Ill try to work on this early next week again.

matthew-mizielinski commented 2 years ago

Some thoughts here.

I fully understand the desire to keep data that might have been used -- I still have a copy of all Met Office data that I have retracted from the CMIP6 archive for the same reasons. Note that ESGF does not do this; when a retraction is run the data files are removed but the records in the index remain (the cost of retaining this data is likely a contributing factor for this policy). Handling retractions through "de-listing" data sets in the catalogues is one way, another might be to relocate such data to a "retracted" directory location, but this almost certainly has additional complications. Not deleting the data also gives you the chance to easily correct any mistakes rather than having to re-download and reprocess.

The Errata service provides valuable information on a range issues, but the simple existence of one connected with a dataset doesn't mean that the data is faulty (see some of the examples above). In your situation, if you have the flexibility to extend the global attributes on Zarr datasets, I would be tempted to add a field such as "connected_errata": [] with a list of links back to the errata system*. This is something we cannot do with netCDF files on ESGF.

Regarding the first of @jbusecke's questions: In case this helps clarify how the global attributes are intended to be used in CMIP6 (apologies if I'm misunderstanding the question above): in the vast majority of cases the member_id is identical to the variant_label. The one case where this is not true is for the ensemble experiments in DCPP. If you search for the experiment dcppA-hindcast in the CMIP6 Experiment id list you'll see that it has a range of valid sub_experiment_id values relating to start dates of the hindcast. This is expressed in the dataset id / directory / file name structure by using <sub_experiment_id>-<variant_label> instead of <variant_label>. While the default sub_experiment_id of none could have been used in this facet of the nomenclature I suspect that it was felt that this wasn't particularly elegant.

If you do manage to produce a list of MOHC datasets from your Zarr store that have been retracted, feel free to point me at it for verification -- I can quickly compare this back to my inventory database as a sanity check.

*should you consider this there might be other valuable attributes that could be added (e.g. DOIs and citation service links), but I think this is a separate conversation

jbusecke commented 2 years ago

The one case where this is not true is for the ensemble experiments in DCPP.

Thanks for the explanation @matthew-mizielinski. I will have to take that into account in cmip6_pp.

A clarification question: What would a unique dataset_id for a dcppA-hindcast run look like? More generally, is this sort of id

activityid.institute_id.source_id.experiment_id.variant_label.table_id.variable_id.grid_klabel.version

unique for all CMIP experiments?

matthew-mizielinski commented 2 years ago

A couple of examples, with breakdowns of individual facets:

instance_id = CMIP6.DCPP.NCAR.CESM1-1-CAM5-CMIP5.dcppA-hindcast.s1964-r29i1p1f1.Amon.psl.gn.v20191007 mip_era = CMIP6 activity_id = DCPP institution_id = NCAR source_id = CESM1-1-CAM5-CMIP5 experiment_id = dcppA-hindcast member_id = s1964-r29i1p1f1 (sub_experiment_id = s1964, variant_label = r29i1p1f1) table_id = Amon variable_id = psl grid_label = gn (native grid) version = v20191007

instance_id = CMIP6.DCPP.CNRM-CERFACS.CNRM-CM6-1.dcppC-amv-neg.r21i1p1f2.AERmon.toz.gr.v20190620 mip_era = CMIP6 activity_id = DCPP institution_id = CNRM-CERFACS source_id = CNRM-CM6-1 experiment_id = dcppC-amv-neg member_id = variant_label = r21i1p1f2 table_id = AERmon variable_id = toz grid_label = gr (implies regridded data) version = v20190620

I would suggest using the full id (referred to as the instance_id in several places on ESGF, the dataset_id is the part without the version*) as your key here. Note that there are projects in the works beyond CMIP6, where a different mip_era would likely be used, so keeping the mip era is valuable.

*Apologies I may have misused dataset_id above.

jbusecke commented 2 years ago

Thanks for the clarification, that is super helpful information to have.

jbusecke commented 2 years ago

Hi everyone,

I believe I have made some major progress on this today. Thanks to the help of @agstephens, I was able to query ESGF directly to get all the retracted datasets (takes only about 4 min to query!). I have written a draft script, which successfully ingests the current 'user facing' catalog at ... and removes the retracted (~9k) stores, and then writes to a new csv on google cloud storage (thx to @cisaacstern for helping me with the authentication).

For a quick sanity check I look at the two csv files and filter for MOHC.UKESM1-0-LL.amip instance ids :

matt_retracted = [
    'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.AERmon.ps.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.ch4.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.clt.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.clwvi.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.evspsbl.gn.20190623',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.hfls.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.hfss.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.hurs.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.hus.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.huss.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.o3.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.pr.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.prc.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.prsn.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.prw.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.ps.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.psl.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rlds.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rlus.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rlut.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rlutcs.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rsds.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rsdt.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rsus.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rsut.gn.20190623',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rsutcs.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.sfcWind.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.ta.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.tas.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.tasmax.gn.20190623',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.tasmin.gn.20190623',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.tauu.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.tauv.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.ts.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.ua.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.uas.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.va.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.vas.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.zg.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Emon.orog.gn.20190623',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.LImon.snw.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.evspsblsoi.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.evspsblveg.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.mrro.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.mrros.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.mrso.gn.20190623',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.mrsos.gn.20190404',
 'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.tran.gn.20190404']

# check for retraced datasets based on the cloud csv
old_catalog = pd.read_csv("https://cmip6.storage.googleapis.com/pangeo-cmip6.csv")
new_catalog = pd.read_csv("https://cmip6.storage.googleapis.com/test_retracted.csv")

def zstore_to_id(a):
    return '.'.join(a.replace('gs://cmip6/','').replace('/v','.').split('/')[0:-1])

merged = old_catalog.merge(new_catalog, on='zstore', how='left', indicator=True)
retracted = merged[merged['_merge']=='left_only']

all_ids = [zstore_to_id(i) for i in merged['zstore'].tolist()]
all_retracted_ids = [zstore_to_id(i) for i in retracted['zstore'].tolist()]
uk_retracted_ids = [i for i in all_retracted_ids if 'UKESM1-0-LL.amip' in i]

for iid in matt_retracted:
    if iid not in uk_retracted_ids:
        print(iid)
        if iid not in all_ids:
            print('fine')
        else:
            print('uh-oh')

which gives

CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.va.gn.20190404
fine
CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.vas.gn.20190404
fine

which suggests that my current method detects all of the retractions identified by @matthew-mizielinski s notebook. The two remaining instance_ids are actually not in our cloud catalog.

Please let me know if anyone here sees any problems with my method, otherwise I would like to relatively quickly move ahead on this issue (to get an initial fixed catalog out before the ocean science meeting).

Moving ahead I see two options: 1) Actually overwrite https://cmip6.storage.googleapis.com/pangeo-cmip6.csv (after a manual backup if something goes wrong) 2) Create a new csv file

They both have some pros and cons, depening on the method users are employing to look at the data (e.g. loading the csv directly vs intake esm json). I think if carefully backed up 1) is the more seamless way (nothing changes for the user).

Thoughts?

jbusecke commented 2 years ago

Some additional considerations that I forgot to mention: There seems to be some 'flakyness' to the esgf query. I have found instances where the retrieved number of instance_ids is not equal to the items found in the response header. I currently error out the script if that happens, but it would be nice to give it a couple of retries (which I found usually resolve the issue).

durack1 commented 2 years ago

@jbusecke you may find that a switch out of

url = "https://esgf-data.dkrz.de/esg-search/search"

With

url = "https://esgf-node.llnl.gov/esg-search/search"

May be more reliable, *note haven't tested the above, I hope my thumbs transcribed it correctly

jbusecke commented 2 years ago

Ah thanks @durack1, ill def give that a try!

rabernat commented 2 years ago

Julius huge 👏👏👏 for pushing forward to a simple and effective solution to this important problem. Excellent work!

In the long term, I think we want to move to a scenario where we are storing our authoritative catalog in Google BigQuery and generating the CSV periodically (daily?) from that.

In the short term, you probably want to:

Generate a new catalog csv
Run some sort of sanity checks on it
Replace the old catalog with the new one

jbusecke commented 2 years ago

100% agree that we need something more stable in the long term. I would suggest we talk about this after Ocean Sciences?

Run some sort of sanity checks on it

Any ideas what to do there? Checking that we dont remove more than a certain % per run maybe?

@durack1 : I am running the action with the updated url here

jbusecke commented 2 years ago

@durack1 : I am running the action with the updated url here

It definitely works, and also seems faster. Time will tell if its more reliable 😜

jbusecke commented 2 years ago

OH wait. So this is weird:

From the dkrz version I get 8385 stores to remove, but from the llnl version I only get 8264 stores.

Can somebody here help me understand the difference?

durack1 commented 2 years ago

@jbusecke yeah we've had problems with index synchronization, there are plans to centralize into a single cloud index, but that's a wish not a reality ATM.

@sashakames, any ideas why the difference, dkrz > LLNL?

durack1 commented 2 years ago

Any ideas what to do there? Checking that we dont remove more than a certain % per run maybe?

This is a tricky one, I've hit a bunch of issues in the past assuming things would work when they didn't. Polling multiple nodes, or having a tiered fail safe is a good idea (LLNL -> DKRZ -> CEDA), if there are inconsistencies, then ponder/pause as you may have uncovered an issue - any resulting actions may cause you even more headaches and a cleanup job

jbusecke commented 2 years ago

That comment was more directed towards checking our internal csv I believe. But that comment makes me think:

What I could implement relatively easily is this:

Poll several nodes (the three suggested by @durack1 for instance)
Merge the results, and only take instance_ids that are present in all three nodes?
Only retract these datasets from our catalog

sashakames commented 2 years ago

@sashakames, any ideas why the difference, dkrz > LLNL?

Strangely enough, I notice issues in our own retraction checking that the LLNL replica shard struggles to give full results for all the MOHC records. Not sure if thats the case here. My general recommendation is to query the "home" index for a particular institution if performing bulk record retrieval, eg, for MOHC use the CEDA index with distrib=false.

jbusecke commented 2 years ago

Hmm that makes it quite a bit more difficult. Is there a mapping from institute to index node available?

I could investigate some more which datasets are unique to which index, but not sure if that is helpful.

@sashakames do you think that the brute force approach (querying multiple nodes with distrib=true and finding the union of the results) has any downsides? E.g. do you expect some nodes to give retractions back falsely, or is it more likely that they are missing some retractions?

aradhakrishnanGFDL commented 2 years ago

100% agree that we need something more stable in the long term. I would suggest we talk about this after Ocean Sciences?

@jbusecke Super helpful actions and issue thread. Since this actions workflow you have pioneered is also applicable for CMIP6 cloud data in AWS (zarr and netcdf), I do envision the sanity checkers and other utilities built around, will tremendously help with a uniform (collaborative, as intended) toolset to "lightly" maintain different CMIP6 buckets. I agree discussing the stable long-term workflow at the next ESGF/Pangeo cloud data working group call would be useful, building on what you have today.

Run some sort of sanity checks on it

Any ideas what to do there? Checking that we dont remove more than a certain % per run maybe?

The issues that Naomi notes as the "serious issues" which I assume may be impediments to generating a functioning zarr stores could be one of those critical checks in the sanity checker utility.

E.g. identifying non-contiguous time in datasets?

(I have also encountered bad DRS paths that have messed up the catalog generation process, which I now skip in the catalog generation process. )

I agree, it's important to document how far we should go with our automated sanity checkers versus expectations from users to run their own (or shared) comprehensive QA checks. I support the latter for the most part.

This may be an overkill, but leaving info in the (old) CSV to indicate the issue (if queryable) may help, although one could also query the errata service with the datasetID or something and get info at a later date. I vaguely recollect Naomi used to document the issues.

@durack1 : I am running the action with the updated url here

jbusecke commented 2 years ago

Thanks for the comments @aradhakrishnanGFDL. Just to clarify something here: My current source catalog is already the 'quality controlled' version created by Naomi. I did it this way, so I can for now retract datasets without thinking about that logic, but I agree that point will need some more discussion.

jbusecke commented 2 years ago

So i spent some time to find out which instance_ids are different for querying differnt index nodes.

The notebook can be found here.

Bottom line of this is that, when comparing DKRZ and LLNL, there are instance_ids that are unique to each node.

So it seems useful to query as many nodes as possible, and then merge the resulting lists (as 'outer'), or is there an issue with this approach?

sashakames commented 2 years ago

@jbusecke Probably you could work with "brute-force" approach for querying multiple and reconciling differences. I don't see the source to the "query_retractions" method so I can't say if there's anything problematic there.

matthew-mizielinski commented 2 years ago

@jbusecke, looks like good progress is being made here, should it be useful to have what I think is the full MOHC list then have a look here -- this should be all the MOHC datasets that you have that I have retracted.

jbusecke commented 2 years ago

@sashakames: Sorry for that. I somewhat silently refactored that into this module.

jbusecke commented 2 years ago

Thanks @matthew-mizielinski, Ill see if I can compare this to the output I get.

If I am doing this with multiple nodes, I might aswell query all of them. Is there a list of urls for all nodes available somewhere? @durack1

jbusecke commented 2 years ago

Is there a known issue with the DKRZ node today? I have been trying to run the same query for several hours without success today.

jbusecke commented 2 years ago

Ok it is done 🤗

~8k stores were removed from our catalog here. Running the action again confirms that we now have a catalog containing 0 retractions.

I will enable the dkrz node now and see if that yields some more retractions.

@matthew-mizielinski I also have manually confirmed that all the instance_ids you provided are marked as retracted!

Thanks everyone for the great help with this! I feel really good going into OSM with a properly maintained catalog.

jbusecke commented 2 years ago

Still on the agenda:

[ ] Adapt the current script to the AWS zarr mirror
[x] Add more nodes

aradhakrishnanGFDL commented 2 years ago

Quick comment about nodes to query, the CMIP6 data citation landing pages point to DKRZ and LLNL nodes. To me, this implies 99% completeness, perhaps. More of a comment only.

durack1 commented 2 years ago

Is there a list of urls for all nodes available somewhere? @durack1

The nodes that are "tier 1" include: https://esgf-node.llnl.gov/search/cmip6/ https://esgf-data.dkrz.de/search/cmip6-dkrz/ https://esgf-index1.ceda.ac.uk/search/cmip6-ceda/ An additional which is normally reliable https://esgf-node.ipsl.upmc.fr/search/cmip6-ipsl/

These are all listed on https://esgf-node.llnl.gov/projects/cmip6/

durack1 commented 2 years ago

Is there a known issue with the DKRZ node today? I have been trying to run the same query for several hours without success today.

This site should be of use https://esgf-node.llnl.gov/status/

jbusecke commented 2 years ago

Fantastic. Exactly what I was looking for @durack1.

jbusecke commented 2 years ago

I have just updated the github action in #34 and the script is now successfully querying LLNL, DKRZ, CEDA, and IPSL.

As of now, our Google catalog is up to date with the retraction information. I think the next big ticket item is to do something similar to the AWS catalog. Should we track that progress here or open a new issue and close this?

durack1 commented 2 years ago

@jbusecke nice! Do you have some diagnostic out from the LLNL/DKRZ/CEDA/IPSL index scour that points out inconsistencies? It'd be nice to know about problems appearing before they causes issues

jbusecke commented 2 years ago

@durack1 at the moment I am only displaying the number of retractions found per node, which could be used as a very rough consistency check each time.

I certainly could implement some more logic here to identify the retractions that are unique to each node (is that the info you would be after?).

If that would be of help, I could also set this sort of stuff up in a separate action, and keep track of how these evolve over time. Happy to implement anything that could help upstream!

durack1 commented 2 years ago

@jbusecke I wonder how flexible github actions are, if there was an index inconsistency (which appears to be a common problem) then maybe you could fling an email with these stats (this might be useful https://github.com/marketplace/actions/send-email) alerting of an issue rather than having to remind yourself to check

jbusecke commented 2 years ago

@durack1 I think that is definitely possible. Should this be forwarded to someone at ESGF in the end (I wouldnt quite know what to do about the inconsistency)?

And what information excactly would be useful?

I could think of two ways to distill that information (these would all be instance_ids) now: 1) Compare each nodes output with the 'outer' union we already use to retract. This would highlight which retractions are 'missing' from each node. 2) I could also construct an 'inner' intersection and then see which datasets are uniquely 'additional' to each node.

The information would be similar, just wondering which of these would be of more use? From a practical viewpoint, I think 1) would be a bit easier to implement now, but not by a lot.

jbusecke commented 2 years ago

@durack1 I have implemented this in the current script: The logs identify the number of missing instance_ids on each node (see here for example), and also create an artifact that contains a csv per node with only the missing ones in them.

durack1 commented 2 years ago

@jbusecke this is great! Are you running this as a cronjob/repeated service? I wonder if inconsistencies are common, or whether these are more uncommon, knowing the answer to this question would be good before figuring out what to do with it

update: and it seems like you are running this daily and the last couple of times it has bombed, right?

jbusecke commented 2 years ago

@durack1 yes this is run daily at night EST. I just checked and the last 6 times were successful (you somehow mentioned it into working lol). Please check the results out here: https://github.com/pangeo-data/pangeo-cmip6-cloud/actions

jbusecke commented 9 months ago

All of this logic was overhauled and now lives here.

pangeo-data / pangeo-cmip6-cloud

Identifying retracted CMIP6 data in the Zarr holdings and removing them #30