Handling non-contiguous datasets

naomi-henderson commented 3 years ago

Each CMIP6 dataset in the ESGF-CoG nodes consists of an identifier(e.g., CMIP6.CMIP.NCC.NorESM2-LM.historical.r2i1p1f1.Omon.thetao.gr) and a version (e.g., 20190920), as seen, for example, here:

CMIP6.CMIP.NCC.NorESM2-LM.historical.r2i1p1f1.Omon.thetao.gr Data Node: noresg.nird.sigma2.no Version: 20190920 Total Number of Files (for all variables): 17

When we look at this dataset, we normally start by concatenating the netcdf files in time (here there are 17), using, for example, the xarray method 'open_mfdataset'.

The problem comes when the netcdf files are not contiguous and therefore the resulting xarray dataset has a time grid which is not complete. Some are relatively easy to spot. For example, if just one of five files is missing it might be obvious that there is a problem.

Example 1: S3 has 4 netcdf files, 5 are needed for continuity

In the current `s3://esgf-world/CMIP6` bucket, there are 4 netcdf files starting with `https://aws-cloudnode.esgfed.org/thredds/fileServer/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp370/r1i1p1f1/Omon/thetao/gr/v20180701/`: ``` ['thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_201501-203412.nc', 'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_203501-205412.nc', 'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_205501-207412.nc', 'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_209501-210012.nc'] ``` At https://esgf-node.llnl.gov/search/cmip6/ there is another file available: ``` ['thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_207501-209412.nc'] ``` Never mind why one is missing, these things easily happen. But if we blindly concatenate the 4 files, we have a large gap in the time grid.

The real problem comes with there are many files and just one is missing.

Example 2: S3 has 85 netcdf files, 86 are needed for continuity

In the current `s3://esgf-world/CMIP6` bucket, there are 85 netcdf files starting with `https://aws-cloudnode.esgfed.org/thredds/fileServer/CMIP6/ScenarioMIP/MIROC/MIROC-ES2L/ssp370/r1i1p1f2/day/vas/gn/v20200318/`: ['vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20150101-20151231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20160101-20161231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20170101-20171231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20180101-20181231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20190101-20191231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20200101-20201231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20210101-20211231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20220101-20221231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20230101-20231231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20240101-20241231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20250101-20251231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20260101-20261231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20270101-20271231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20280101-20281231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20290101-20291231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20300101-20301231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20310101-20311231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20320101-20321231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20330101-20331231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20340101-20341231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20350101-20351231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20360101-20361231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20370101-20371231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20380101-20381231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20390101-20391231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20400101-20401231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20410101-20411231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20420101-20421231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20430101-20431231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20440101-20441231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20450101-20451231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20460101-20461231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20470101-20471231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20480101-20481231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20490101-20491231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20500101-20501231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20510101-20511231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20520101-20521231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20530101-20531231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20540101-20541231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20550101-20551231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20560101-20561231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20570101-20571231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20580101-20581231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20590101-20591231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20600101-20601231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20610101-20611231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20620101-20621231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20630101-20631231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20640101-20641231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20650101-20651231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20660101-20661231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20670101-20671231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20680101-20681231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20690101-20691231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20700101-20701231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20720101-20721231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20730101-20731231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20740101-20741231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20750101-20751231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20760101-20761231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20770101-20771231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20780101-20781231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20790101-20791231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20800101-20801231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20810101-20811231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20820101-20821231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20830101-20831231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20840101-20841231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20850101-20851231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20860101-20861231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20870101-20871231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20880101-20881231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20890101-20891231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20900101-20901231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20910101-20911231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20920101-20921231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20930101-20931231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20940101-20941231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20950101-20951231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20960101-20961231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20970101-20971231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20980101-20981231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20990101-20991231.nc', 'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_21000101-21001231.nc'] The year 2071 is missing from S3, although it is available through ESGF-CoG.

In these two examples, the netcdf files are missing, but do exist. There are many other examples where the missing files are not available by some oversight. For others, the files were never meant to be uploaded. For instance, particular experiments are often reported (by some, not all modeling centers) for just a subset of the run time. For example, some of the 'abrupt-4xCO2' datasets only report one chunk of at the beginning of the experiment (adjustment phase) and one chunk at the end (equilibrium). So I have allowed discontinuities in the 'abrupt-4xCO2' datasets (legitimate or not). Some datasets seem to have one year of daily data for a subset of the years - so there are many discontinuities.

So here are some questions for opening this issue:

Should we somehow allow the 'legitimate' non-contiguous datasets? If so, should we divide them up into contiguous chunks and serve them separately?
What to do about datasets with missing netcdf files? Certainly we could try complete the list, but if the files do not exist, what then?

A cursory check of the current contents of the 's3://esgf-world/CMIP6' collection of netcdf files shows the following for the 212,299 datasets (collections of netcdf files) currently in the bucket, where 'total' is the number of datasets at the given frequency and 'non-contiguous' is the number of these datasets which have a non-contiguous set of netcdf files. I didn't check the hourly and sub-hourly datasets, since my crude method of using the netcdf file names to infer missing days is not as reliable for sub-daily datasets.

frequency	total	non-contiguous	percent
yearly	23758	94	3%
monthly	179953	2490	1.4%
daily	23758	1947	8%
hourly	4471	not checked
sub-hourly	749	not checked

naomi-henderson commented 3 years ago

@aradhakrishnanGFDL , I just opened an issue on pangeo-forge/cmip6-pipeline to get our conversation going on the non-contiguous dataset issue. I could provide a listing of the datasets, if that would be useful.

agstephens commented 3 years ago

@naomi-henderson, not sure if this is useful, but IPSL have written a nice tool to do time-axis checking: http://prodiguer.github.io/nctime/index.html

naomi-henderson commented 3 years ago

@agstephens - very helpful! thanks

naomi-henderson commented 3 years ago

Ah, they are reading the netcdf files to get the calendar but I am trying to use just the file names themselves to see if there is a file missing ... Opening the first netcdf file in each dataset would be more reliable - so this may be useful later on

zflamig commented 3 years ago

This is very helpful and detailed!

For others, the files were never meant to be uploaded.

Do you have a sense of how best to communicate this to users of these datasets? I am worried that people will make assumptions about what the data should look like and pass blame when it doesn't match their expectations.

For a more concrete example, I care most about ScenarioMIP right now, and not every center/model was run for every ssp. I've sometimes referenced https://pcmdi.llnl.gov/CMIP6/ArchiveStatistics/esgf_data_holdings/ScenarioMIP/index.html for a table on what exists/what doesn't but its a little bit tricky to read. I'm wondering if we should have something like this for the cloud holdings where Grey = never exists, Blue = Zarr, Green = Netcdf, Purple = Both, Yellow = exists but not on cloud.

naomi-henderson commented 3 years ago

Yes, I agree that we do not have very effective ways to communicate to the users! In fact, I even keep forgetting about those tables kept at pcmdi! I like the idea of color coding the cloud holdings - need to keep that in mind!

I think it would be much better to have efficient tools for querying all of the cloud holdings directly! Then we won't have to generate static tables, etc, and worry about keeping them current.

naomi-henderson commented 3 years ago

@aradhakrishnanGFDL, I have put 3 lists of non-contiguous datasets (yearly, monthly and daily) into our S3 bucket:

There is also a python notebook for checking the differences between the current S3 zarr and S3 netcdf buckets:

https://cmip6-pds.s3.amazonaws.com/compare/CompareAWS-NCvsZarr.ipynb

For example: NetCDFvZarr stats

aradhakrishnanGFDL commented 3 years ago

Hi @naomi-henderson Great. Thank you. I will plug in the esgf-world csv from https://cmip6-nc.s3.us-east-2.amazonaws.com/esgf-world.csv.gz (will be refreshed again this week). Just to clarify, the three non-contiguous lists you provided are those that I will need to to exclude from the esgf-world csv before another round of comparison from my end? Thanks,

naomi-henderson commented 3 years ago

Hi @aradhakrishnanGFDL , good. I didn't bother to exclude the non-contiguous datasets since there were not so many. I just thought it might give a better idea of the issues.

The esgf-world csv used in the notebook is fairly recent, March 15, I think, and I had crawled the 'esgf-world' bucket to create it. Is there also a 'cmip6-nc' bucket? Perhaps I used the wrong bucket?

aradhakrishnanGFDL commented 3 years ago

Ok, sounds good @naomi-henderson . You did use the right bucket: esgf-world. cmip6-nc is just the bucket with the intake catalogs and such. It could use a better name! I just updated the CSV https://cmip6-nc.s3.us-east-2.amazonaws.com/esgf-world.csv.gz as well, not sure if the results would change drastically. I used a quick script at https://github.com/aradhakrishnanGFDL/CatalogBuilder/blob/master/gen_intake_s3.py to generate the catalog.

aradhakrishnanGFDL commented 3 years ago

Quick update and info- Here is the slightly modified comparison notebook using the latest esgf-world catalog. The catalog still does not account for the time discontinuity. But, we are planning to incorporate the check to some extent in our UDA (internal to GFDL) and S3 -- sanity checker script, though the details are yet to be determined (e.g. query ESGF API or thredds to see if the file is missing there as well or not -- to account for the cases you described)

pangeo-forge / cmip6-pipeline

Handling non-contiguous datasets #14