Open naomi-henderson opened 3 years ago
@aradhakrishnanGFDL , I just opened an issue on pangeo-forge/cmip6-pipeline
to get our conversation going on the non-contiguous dataset issue. I could provide a listing of the datasets, if that would be useful.
@naomi-henderson, not sure if this is useful, but IPSL have written a nice tool to do time-axis checking: http://prodiguer.github.io/nctime/index.html
@agstephens - very helpful! thanks
Ah, they are reading the netcdf files to get the calendar but I am trying to use just the file names themselves to see if there is a file missing ... Opening the first netcdf file in each dataset would be more reliable - so this may be useful later on
This is very helpful and detailed!
For others, the files were never meant to be uploaded.
Do you have a sense of how best to communicate this to users of these datasets? I am worried that people will make assumptions about what the data should look like and pass blame when it doesn't match their expectations.
For a more concrete example, I care most about ScenarioMIP right now, and not every center/model was run for every ssp. I've sometimes referenced https://pcmdi.llnl.gov/CMIP6/ArchiveStatistics/esgf_data_holdings/ScenarioMIP/index.html for a table on what exists/what doesn't but its a little bit tricky to read. I'm wondering if we should have something like this for the cloud holdings where Grey = never exists, Blue = Zarr, Green = Netcdf, Purple = Both, Yellow = exists but not on cloud.
Yes, I agree that we do not have very effective ways to communicate to the users! In fact, I even keep forgetting about those tables kept at pcmdi! I like the idea of color coding the cloud holdings - need to keep that in mind!
I think it would be much better to have efficient tools for querying all of the cloud holdings directly! Then we won't have to generate static tables, etc, and worry about keeping them current.
@aradhakrishnanGFDL, I have put 3 lists of non-contiguous datasets (yearly, monthly and daily) into our S3 bucket:
There is also a python notebook for checking the differences between the current S3 zarr and S3 netcdf buckets:
For example:
Hi @naomi-henderson Great. Thank you. I will plug in the esgf-world csv from https://cmip6-nc.s3.us-east-2.amazonaws.com/esgf-world.csv.gz (will be refreshed again this week). Just to clarify, the three non-contiguous lists you provided are those that I will need to to exclude from the esgf-world csv before another round of comparison from my end? Thanks,
Hi @aradhakrishnanGFDL , good. I didn't bother to exclude the non-contiguous datasets since there were not so many. I just thought it might give a better idea of the issues.
The esgf-world csv used in the notebook is fairly recent, March 15, I think, and I had crawled the 'esgf-world' bucket to create it. Is there also a 'cmip6-nc' bucket? Perhaps I used the wrong bucket?
Ok, sounds good @naomi-henderson . You did use the right bucket: esgf-world. cmip6-nc is just the bucket with the intake catalogs and such. It could use a better name! I just updated the CSV https://cmip6-nc.s3.us-east-2.amazonaws.com/esgf-world.csv.gz as well, not sure if the results would change drastically. I used a quick script at https://github.com/aradhakrishnanGFDL/CatalogBuilder/blob/master/gen_intake_s3.py to generate the catalog.
Quick update and info- Here is the slightly modified comparison notebook using the latest esgf-world catalog. The catalog still does not account for the time discontinuity. But, we are planning to incorporate the check to some extent in our UDA (internal to GFDL) and S3 -- sanity checker script, though the details are yet to be determined (e.g. query ESGF API or thredds to see if the file is missing there as well or not -- to account for the cases you described)
Each CMIP6 dataset in the ESGF-CoG nodes consists of an identifier(e.g., CMIP6.CMIP.NCC.NorESM2-LM.historical.r2i1p1f1.Omon.thetao.gr) and a version (e.g., 20190920), as seen, for example, here:
When we look at this dataset, we normally start by concatenating the netcdf files in time (here there are 17), using, for example, the xarray method 'open_mfdataset'.
The problem comes when the netcdf files are not contiguous and therefore the resulting xarray dataset has a time grid which is not complete. Some are relatively easy to spot. For example, if just one of five files is missing it might be obvious that there is a problem.
Example 1: S3 has 4 netcdf files, 5 are needed for continuity
The real problem comes with there are many files and just one is missing.
Example 2: S3 has 85 netcdf files, 86 are needed for continuity
In these two examples, the netcdf files are missing, but do exist. There are many other examples where the missing files are not available by some oversight. For others, the files were never meant to be uploaded. For instance, particular experiments are often reported (by some, not all modeling centers) for just a subset of the run time. For example, some of the 'abrupt-4xCO2' datasets only report one chunk of at the beginning of the experiment (adjustment phase) and one chunk at the end (equilibrium). So I have allowed discontinuities in the 'abrupt-4xCO2' datasets (legitimate or not). Some datasets seem to have one year of daily data for a subset of the years - so there are many discontinuities.
So here are some questions for opening this issue:
A cursory check of the current contents of the 's3://esgf-world/CMIP6' collection of netcdf files shows the following for the 212,299 datasets (collections of netcdf files) currently in the bucket, where 'total' is the number of datasets at the given frequency and 'non-contiguous' is the number of these datasets which have a non-contiguous set of netcdf files. I didn't check the hourly and sub-hourly datasets, since my crude method of using the netcdf file names to infer missing days is not as reliable for sub-daily datasets.