Closed matthew-mizielinski closed 9 months ago
Matt this is a great place for the request. Thanks a lot for your helpful comments. I agree completely with everything you say.
We used to have a (semi-manual) process in place for removing retracted data. But the person who managed it has retired. (Pinging @naomi-henderson with no expectation that she will reply.) So we need to transition to a more automated system.
@jbusecke - do you see a way to take @matthew-mizielinski's notebook and turn it into a cron job we can run with github workflows?
Thanks Ryan,
If you do start sending large numbers of queries to the Errata and Handle services through code such as this do let me know -- I'd like to keep the developers of those systems in the loop so that they can look at how well their services are coping with the traffic (I've no idea whether what will be needed here is a large amount of traffic or just a drop in the ocean to those services, but might be worth checking).
If you have access to the files themselves and the
tracking_id
field from within the global attributes then a similar method could be applied that does not rely on the Errata service.
Fortunately our archive does keep track of the tracking_id
s that went into each Zarr
The original datasets, stored in the WCRP/CMIP6 ESGF repositories, consist of netCDF files. Each of these datasets typically corresponds to a single variable saved at specified time intervals for the length of a single model run. For convenience, these datasets were often divided into multiple netCDF files. Our Zarr data stores correspond to the result of concatenating these netCDF files and then storing them as a single Zarr object. All of the metadata in the original netCDF files has been preserved, including the licensing information and the CMIP6 persistent identifiers (tracking_ids) which are unique for each of the original netCDF files.
https://pangeo-data.github.io/pangeo-cmip6-cloud/overview.html#zarr-storage-format
Thank you so much for all this work @matthew-mizielinski. I am looking into this now and will report back.
@matthew-mizielinski @jbusecke and @rabernat , I thought I would also mention that I had not been actually removing datasets from zarr storage which have issues filed through the Errata Service. They only get removed from the main CSV catalog file. There is a separate catalog file, gs://cmip6/cmip6-zarr-consolidated-stores-noQC.csv
(noQC meaning no quality control), which DOES still list the datasets with filed errata issues. This is so that folks who have used the datasets prior to the filing of an errata issue can still reproduce and understand their old results. If the datasets disappear there is no way to do forensics on old results ...
Hi Naomi! Thanks for chiming in from retirement-world! 😉 Sorry to be pestering you with CMIP6-related stuff, but your experience is so helpful.
Could you explain how you had been removing datasets from the catalog? What was your process for querying the errata and cross referencing it with the catalog? Any links to notebooks or scripts would be so helpful.
Yes, though it is a little complicated.
In ESDOC there is a notebook which attempts to explain the half-automatic way I used to query the Errata Service and update lists of issues. I was using the esgissue
command from the esdoc-errata-client
package to get all of the current issues. Then the notebook parses the issues as best it can, producing the csv/ES-DOC_issues.csv
file for use in generating the catalog, see MakeCloudCat_GC.ipynb. As with all CMIP6 topics, there are many complicating factors, but that is the basic idea ....
So I generate TWO CSV catalogs, one which skips any datasets with serious
issues (a very fuzzy definition should be in the notebook) when generating the main catalog and another which has all of the datasets with 3 extra columns added which list the status
, severity
and issue_url
for each dataset. For example, one line in the CSV file is:
HighResMIP,CMCC,CMCC-CM2-HR4,highresSST-present,r1i1p1f1,Amon,tasmin,gn,gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/Amon/tasmin/gn/v20170706/,,20170706,wontfix,medium,https://errata.es-doc.org/static/view.html?uid=9b40a054-21a7-5ae7-a3eb-8c373c5adddc
which indicates that the zarr store gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/highresSST-present/r1i1p1f1/Amon/tasmin/gn/v20170706/
has an issue with status='wontfix' and severity='medium' which can be checked at https://errata.es-doc.org/static/view.html?uid=9b40a054-21a7-5ae7-a3eb-8c373c5adddc
Hi @naomi-henderson thank you so much. I like the approach of having a 'raw' catalog, which then gets 'masked' for the users.
Let me try to give an update on what I have found today, where I might need some help, and my plans going forward.
Some nomenclature:
dataset
A datasets concatenated in time, equivalent to a zarr store in the cloud holdings. The smallest unit of data we have in this catalog.
raw_catalog
the no_qc
catalog @naomi-henderson mentioned, with everything that was ever uploaded.
public_catalog
the 'masked' catalog which is exposed to the public.
Central identification of datasets
An important aspect here and for other aspects of working with these datasets is the ability to uniquely identify a dataset. It seems that a dataset_id
( in the form of activityid.institute_id.source_id.experiment_id.variant_label.table_id.variable_id.grid_klabel.version
) does that job. From @matthew-mizielinski s notebook it seems like this is sufficient to query errata. This signature is conveniently also our zarr store name. Any objections to using these as unique identifiers of a dataset?
How to identify retractions Thanks to @matthew-mizielinski I think I have a good idea what I need to implement to identify retracted datasets. I have been experimenting with grabbing the handle_id values directly from our zarr stores (instead of querying errata which is quite slow), and this seems to work well! I was able to identify all the retracted datasets from @matthew-mizielinski s example, and some others! I will clean up that code and run it over the full catalog soon.
How to delete/mask the found datasets But first I would like to discuss how we go about the 'deactivation' of the datasets that we identified above? My suggestion (happy to adjust) is the following:
dataset_ids
reflecting the most up to date retractions. That file could be commited to the repo and used to later on rebuilding a certain retracted store, if we end up deleting it.public_catalog
.We could implement some sort of check, where a maintainer has to 'sign off' on newly added retractions, or hope that everything is going to work out fine 😜.
I think this approach is generally better than deleting the stores (we can talk about that elsewhere) in case something does go wrong it seems easier to recreate a csv file than reupload data.
Retracted vs issues A connected issue here is the question of how much 'quality control' we want to do for the user here. I think removing retractions is obvious, but I think @naomi-henderson had removed even some other datasets with issues? Any suggestions for a heuristic what 'goes public' and what doesnt?
I personally would favor an approach where we remove as little as possible and maybe implement some functionality on a higher level (e.g. cmip6_pp) where the user could check for issues on a dataset basis (using some of the logic use here). But maybe there is a good reason not to do this.
More general Questions
variant_label
has been renamed to member_id
. This has given me quite the headache over at cmip6_preprocessing, and I still have not been able to understand why this was introduced. Going forward, are we going to carry this depth or opt for a new catalog, which is organized strictly by facet names which are also used in the dataset attributes.any
of them is bad (that is what I have implemented so far). In the future we might want to think about an automatic retrigger mechanism that would try to rebuild that dataset specific store. Thats it for today. Ill try to work on this early next week again.
Some thoughts here.
I fully understand the desire to keep data that might have been used -- I still have a copy of all Met Office data that I have retracted from the CMIP6 archive for the same reasons. Note that ESGF does not do this; when a retraction is run the data files are removed but the records in the index remain (the cost of retaining this data is likely a contributing factor for this policy). Handling retractions through "de-listing" data sets in the catalogues is one way, another might be to relocate such data to a "retracted" directory location, but this almost certainly has additional complications. Not deleting the data also gives you the chance to easily correct any mistakes rather than having to re-download and reprocess.
The Errata service provides valuable information on a range issues, but the simple existence of one connected with a dataset doesn't mean that the data is faulty (see some of the examples above). In your situation, if you have the flexibility to extend the global attributes on Zarr datasets, I would be tempted to add a field such as "connected_errata": []
with a list of links back to the errata system*. This is something we cannot do with netCDF files on ESGF.
Regarding the first of @jbusecke's questions:
In case this helps clarify how the global attributes are intended to be used in CMIP6 (apologies if I'm misunderstanding the question above): in the vast majority of cases the member_id
is identical to the variant_label
. The one case where this is not true is for the ensemble experiments in DCPP. If you search for the experiment dcppA-hindcast
in the CMIP6 Experiment id list you'll see that it has a range of valid sub_experiment_id
values relating to start dates of the hindcast. This is expressed in the dataset id / directory / file name structure by using <sub_experiment_id>-<variant_label>
instead of <variant_label>
. While the default sub_experiment_id
of none
could have been used in this facet of the nomenclature I suspect that it was felt that this wasn't particularly elegant.
If you do manage to produce a list of MOHC datasets from your Zarr store that have been retracted, feel free to point me at it for verification -- I can quickly compare this back to my inventory database as a sanity check.
*should you consider this there might be other valuable attributes that could be added (e.g. DOIs and citation service links), but I think this is a separate conversation
The one case where this is not true is for the ensemble experiments in DCPP.
Thanks for the explanation @matthew-mizielinski. I will have to take that into account in cmip6_pp.
A clarification question: What would a unique dataset_id
for a dcppA-hindcast
run look like?
More generally, is this sort of id
activityid.institute_id.source_id.experiment_id.variant_label.table_id.variable_id.grid_klabel.version
unique for all CMIP experiments?
A couple of examples, with breakdowns of individual facets:
instance_id = CMIP6.DCPP.NCAR.CESM1-1-CAM5-CMIP5.dcppA-hindcast.s1964-r29i1p1f1.Amon.psl.gn.v20191007
mip_era = CMIP6
activity_id = DCPP
institution_id = NCAR
source_id = CESM1-1-CAM5-CMIP5
experiment_id = dcppA-hindcast
member_id = s1964-r29i1p1f1 (sub_experiment_id = s1964, variant_label = r29i1p1f1)
table_id = Amon
variable_id = psl
grid_label = gn (native grid)
version = v20191007
instance_id = CMIP6.DCPP.CNRM-CERFACS.CNRM-CM6-1.dcppC-amv-neg.r21i1p1f2.AERmon.toz.gr.v20190620
mip_era = CMIP6
activity_id = DCPP
institution_id = CNRM-CERFACS
source_id = CNRM-CM6-1
experiment_id = dcppC-amv-neg
member_id = variant_label = r21i1p1f2
table_id = AERmon
variable_id = toz
grid_label = gr (implies regridded data)
version = v20190620
I would suggest using the full id (referred to as the instance_id in several places on ESGF, the dataset_id is the part without the version*) as your key here. Note that there are projects in the works beyond CMIP6, where a different mip_era would likely be used, so keeping the mip era is valuable.
*Apologies I may have misused dataset_id
above.
Thanks for the clarification, that is super helpful information to have.
Hi everyone,
I believe I have made some major progress on this today. Thanks to the help of @agstephens, I was able to query ESGF directly to get all the retracted datasets (takes only about 4 min to query!). I have written a draft script, which successfully ingests the current 'user facing' catalog at ... and removes the retracted (~9k) stores, and then writes to a new csv on google cloud storage (thx to @cisaacstern for helping me with the authentication).
For a quick sanity check I look at the two csv files and filter for MOHC.UKESM1-0-LL.amip
instance ids :
matt_retracted = [
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.AERmon.ps.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.ch4.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.clt.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.clwvi.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.evspsbl.gn.20190623',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.hfls.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.hfss.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.hurs.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.hus.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.huss.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.o3.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.pr.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.prc.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.prsn.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.prw.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.ps.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.psl.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rlds.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rlus.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rlut.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rlutcs.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rsds.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rsdt.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rsus.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rsut.gn.20190623',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.rsutcs.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.sfcWind.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.ta.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.tas.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.tasmax.gn.20190623',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.tasmin.gn.20190623',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.tauu.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.tauv.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.ts.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.ua.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.uas.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.va.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.vas.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.zg.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Emon.orog.gn.20190623',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.LImon.snw.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.evspsblsoi.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.evspsblveg.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.mrro.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.mrros.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.mrso.gn.20190623',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.mrsos.gn.20190404',
'CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Lmon.tran.gn.20190404']
# check for retraced datasets based on the cloud csv
old_catalog = pd.read_csv("https://cmip6.storage.googleapis.com/pangeo-cmip6.csv")
new_catalog = pd.read_csv("https://cmip6.storage.googleapis.com/test_retracted.csv")
def zstore_to_id(a):
return '.'.join(a.replace('gs://cmip6/','').replace('/v','.').split('/')[0:-1])
merged = old_catalog.merge(new_catalog, on='zstore', how='left', indicator=True)
retracted = merged[merged['_merge']=='left_only']
all_ids = [zstore_to_id(i) for i in merged['zstore'].tolist()]
all_retracted_ids = [zstore_to_id(i) for i in retracted['zstore'].tolist()]
uk_retracted_ids = [i for i in all_retracted_ids if 'UKESM1-0-LL.amip' in i]
for iid in matt_retracted:
if iid not in uk_retracted_ids:
print(iid)
if iid not in all_ids:
print('fine')
else:
print('uh-oh')
which gives
CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.va.gn.20190404
fine
CMIP6.CMIP.MOHC.UKESM1-0-LL.amip.r1i1p1f2.Amon.vas.gn.20190404
fine
which suggests that my current method detects all of the retractions identified by @matthew-mizielinski s notebook. The two remaining instance_ids are actually not in our cloud catalog.
Please let me know if anyone here sees any problems with my method, otherwise I would like to relatively quickly move ahead on this issue (to get an initial fixed catalog out before the ocean science meeting).
Moving ahead I see two options: 1) Actually overwrite https://cmip6.storage.googleapis.com/pangeo-cmip6.csv (after a manual backup if something goes wrong) 2) Create a new csv file
They both have some pros and cons, depening on the method users are employing to look at the data (e.g. loading the csv directly vs intake esm json). I think if carefully backed up 1) is the more seamless way (nothing changes for the user).
Thoughts?
Some additional considerations that I forgot to mention: There seems to be some 'flakyness' to the esgf query. I have found instances where the retrieved number of instance_ids is not equal to the items found in the response header. I currently error out the script if that happens, but it would be nice to give it a couple of retries (which I found usually resolve the issue).
@jbusecke you may find that a switch out of
url = "https://esgf-data.dkrz.de/esg-search/search"
With
url = "https://esgf-node.llnl.gov/esg-search/search"
May be more reliable, *note haven't tested the above, I hope my thumbs transcribed it correctly
Ah thanks @durack1, ill def give that a try!
Julius huge 👏👏👏 for pushing forward to a simple and effective solution to this important problem. Excellent work!
In the long term, I think we want to move to a scenario where we are storing our authoritative catalog in Google BigQuery and generating the CSV periodically (daily?) from that.
In the short term, you probably want to:
100% agree that we need something more stable in the long term. I would suggest we talk about this after Ocean Sciences?
Run some sort of sanity checks on it
Any ideas what to do there? Checking that we dont remove more than a certain % per run maybe?
@durack1 : I am running the action with the updated url here
@durack1 : I am running the action with the updated url here
It definitely works, and also seems faster. Time will tell if its more reliable 😜
OH wait. So this is weird:
From the dkrz version I get 8385 stores to remove, but from the llnl version I only get 8264 stores.
Can somebody here help me understand the difference?
@jbusecke yeah we've had problems with index synchronization, there are plans to centralize into a single cloud index, but that's a wish not a reality ATM.
@sashakames, any ideas why the difference, dkrz > LLNL?
Any ideas what to do there? Checking that we dont remove more than a certain % per run maybe?
This is a tricky one, I've hit a bunch of issues in the past assuming things would work when they didn't. Polling multiple nodes, or having a tiered fail safe is a good idea (LLNL -> DKRZ -> CEDA), if there are inconsistencies, then ponder/pause as you may have uncovered an issue - any resulting actions may cause you even more headaches and a cleanup job
That comment was more directed towards checking our internal csv I believe. But that comment makes me think:
What I could implement relatively easily is this:
@sashakames, any ideas why the difference, dkrz > LLNL?
Strangely enough, I notice issues in our own retraction checking that the LLNL replica shard struggles to give full results for all the MOHC records. Not sure if thats the case here. My general recommendation is to query the "home" index for a particular institution if performing bulk record retrieval, eg, for MOHC use the CEDA index with distrib=false.
Hmm that makes it quite a bit more difficult. Is there a mapping from institute to index node available?
I could investigate some more which datasets are unique to which index, but not sure if that is helpful.
@sashakames do you think that the brute force approach (querying multiple nodes with distrib=true and finding the union of the results) has any downsides? E.g. do you expect some nodes to give retractions back falsely, or is it more likely that they are missing some retractions?
100% agree that we need something more stable in the long term. I would suggest we talk about this after Ocean Sciences?
@jbusecke Super helpful actions and issue thread. Since this actions workflow you have pioneered is also applicable for CMIP6 cloud data in AWS (zarr and netcdf), I do envision the sanity checkers and other utilities built around, will tremendously help with a uniform (collaborative, as intended) toolset to "lightly" maintain different CMIP6 buckets. I agree discussing the stable long-term workflow at the next ESGF/Pangeo cloud data working group call would be useful, building on what you have today.
Run some sort of sanity checks on it
Any ideas what to do there? Checking that we dont remove more than a certain % per run maybe?
The issues that Naomi notes as the "serious issues" which I assume may be impediments to generating a functioning zarr stores could be one of those critical checks in the sanity checker utility.
E.g. identifying non-contiguous time in datasets?
(I have also encountered bad DRS paths that have messed up the catalog generation process, which I now skip in the catalog generation process. )
I agree, it's important to document how far we should go with our automated sanity checkers versus expectations from users to run their own (or shared) comprehensive QA checks. I support the latter for the most part.
This may be an overkill, but leaving info in the (old) CSV to indicate the issue (if queryable) may help, although one could also query the errata service with the datasetID or something and get info at a later date. I vaguely recollect Naomi used to document the issues.
@durack1 : I am running the action with the updated url here
Thanks for the comments @aradhakrishnanGFDL. Just to clarify something here: My current source
catalog is already the 'quality controlled' version created by Naomi. I did it this way, so I can for now retract datasets without thinking about that logic, but I agree that point will need some more discussion.
So i spent some time to find out which instance_ids are different for querying differnt index nodes.
The notebook can be found here.
Bottom line of this is that, when comparing DKRZ and LLNL, there are instance_ids that are unique to each node.
So it seems useful to query as many nodes as possible, and then merge the resulting lists (as 'outer'), or is there an issue with this approach?
@jbusecke Probably you could work with "brute-force" approach for querying multiple and reconciling differences. I don't see the source to the "query_retractions" method so I can't say if there's anything problematic there.
@jbusecke, looks like good progress is being made here, should it be useful to have what I think is the full MOHC list then have a look here -- this should be all the MOHC datasets that you have that I have retracted.
@sashakames: Sorry for that. I somewhat silently refactored that into this module.
Thanks @matthew-mizielinski, Ill see if I can compare this to the output I get.
If I am doing this with multiple nodes, I might aswell query all of them. Is there a list of urls for all nodes available somewhere? @durack1
Is there a known issue with the DKRZ node today? I have been trying to run the same query for several hours without success today.
Ok it is done 🤗
~8k stores were removed from our catalog here. Running the action again confirms that we now have a catalog containing 0 retractions.
I will enable the dkrz node now and see if that yields some more retractions.
@matthew-mizielinski I also have manually confirmed that all the instance_ids you provided are marked as retracted!
Thanks everyone for the great help with this! I feel really good going into OSM with a properly maintained catalog.
Still on the agenda:
Quick comment about nodes to query, the CMIP6 data citation landing pages point to DKRZ and LLNL nodes. To me, this implies 99% completeness, perhaps. More of a comment only.
Is there a list of urls for all nodes available somewhere? @durack1
The nodes that are "tier 1" include: https://esgf-node.llnl.gov/search/cmip6/ https://esgf-data.dkrz.de/search/cmip6-dkrz/ https://esgf-index1.ceda.ac.uk/search/cmip6-ceda/ An additional which is normally reliable https://esgf-node.ipsl.upmc.fr/search/cmip6-ipsl/
These are all listed on https://esgf-node.llnl.gov/projects/cmip6/
Is there a known issue with the DKRZ node today? I have been trying to run the same query for several hours without success today.
This site should be of use https://esgf-node.llnl.gov/status/
Fantastic. Exactly what I was looking for @durack1.
I have just updated the github action in #34 and the script is now successfully querying LLNL, DKRZ, CEDA, and IPSL.
As of now, our Google catalog is up to date with the retraction information. I think the next big ticket item is to do something similar to the AWS catalog. Should we track that progress here or open a new issue and close this?
@jbusecke nice! Do you have some diagnostic out from the LLNL/DKRZ/CEDA/IPSL index scour that points out inconsistencies? It'd be nice to know about problems appearing before they causes issues
@durack1 at the moment I am only displaying the number of retractions found per node, which could be used as a very rough consistency check each time.
I certainly could implement some more logic here to identify the retractions that are unique to each node (is that the info you would be after?).
If that would be of help, I could also set this sort of stuff up in a separate action, and keep track of how these evolve over time. Happy to implement anything that could help upstream!
@jbusecke I wonder how flexible github actions are, if there was an index inconsistency (which appears to be a common problem) then maybe you could fling an email with these stats (this might be useful https://github.com/marketplace/actions/send-email) alerting of an issue rather than having to remind yourself to check
@durack1 I think that is definitely possible. Should this be forwarded to someone at ESGF in the end (I wouldnt quite know what to do about the inconsistency)?
And what information excactly would be useful?
I could think of two ways to distill that information (these would all be instance_ids) now: 1) Compare each nodes output with the 'outer' union we already use to retract. This would highlight which retractions are 'missing' from each node. 2) I could also construct an 'inner' intersection and then see which datasets are uniquely 'additional' to each node.
The information would be similar, just wondering which of these would be of more use? From a practical viewpoint, I think 1) would be a bit easier to implement now, but not by a lot.
@jbusecke this is great! Are you running this as a cronjob/repeated service? I wonder if inconsistencies are common, or whether these are more uncommon, knowing the answer to this question would be good before figuring out what to do with it
update: and it seems like you are running this daily and the last couple of times it has bombed, right?
@durack1 yes this is run daily at night EST. I just checked and the last 6 times were successful (you somehow mentioned it into working lol). Please check the results out here: https://github.com/pangeo-data/pangeo-cmip6-cloud/actions
Hi,
As mentioned at the last meeting there is a certain amount of Met Office data that I have retracted from the CMIP6 project on ESGF that is listed as available in the Pangeo cloud Zarr holdings on AWS (I have not checked the Google cloud listings). There are a range of reasons why we have retracted data, from faulty coordinate information to issues with the setup of the models, e.g. faulty ancillary input files in some UKESM1 simulations.
While I could provide a list of all datasets with the institution id "MOHC" that I have retracted, I know this could miss data published by other institutions, so I've put together an example of how you could iterate over all datasets to (a) look for connected Errata and (b) attempt to look up the status of datasets on ESGF via the handle service.
Link to demonstration notebook
Note that this method of querying the Errata service, and then the handle service, depends on Errata being issued by the modelling groups that own the data, and there are likely some gaps (i.e. retracted datasets for which there is not a corresponding errata entry). The existence of an errata does not indicate whether affected data has been retracted; there are
wont-fix
errata which detail an issue that the modelling group does not intend to address or does not require the retraction of data, e.g. minor metadata issues and the extension of datasets in time. The method I've used in the example code linked above is not necessarily the quickest or most efficient. If you have access to the files themselves and thetracking_id
field from within the global attributes then a similar method could be applied that does not rely on the Errata service.I am particularly keen, given the popularity of this service, that the status of data presented here is consistent with that in the CMIP6 archives served by ESGF. Let me know if there is any further information that would be helpful here, and please point me in the appropriate direction if this is not the best place for this request.
Matt Mizielinski Climate Data Delivery Manager, Met Office Hadley Centre WGCM Infrastructure Panel co-chair