pangeo-forge / cmip6-pipeline

Pipeline for cloud-based CMIP6 data ingestion pipeline
Apache License 2.0
1 stars 5 forks source link

Issue with CanESM5-r*i1p1f1 ssp245 Omon data #21

Open swartn opened 3 years ago

swartn commented 3 years ago

There is an issue loading data for CanESM5 r*i1p1f1 for ss245 from the GCS catalogue. Loading the original ESGF data works as expected. This appears to be an issue for variables in the Omon table. The same variables for different experiments / r1i1p2f1 work as expected.

It is not immediately obvious what the issue is. The basic metadata seems ok, but any attempt to load / plot the data fails (verified across multiple users and machines). Could this be an issue in the conversion from netcdf?

Thanks!

A minimal working example:

cat_url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"       # use this outside of CCCma / for public data
col = intake.open_esm_datastore(cat_url)
query = dict(experiment_id=['ssp245'], table_id=['Omon'], member_id='r1i1p1f1',
             variable_id=['fgco2'], source_id='CanESM5')
cat = col.search(**query)

ds_g = cat.to_dataset_dict(zarr_kwargs={"consolidated": True, "decode_times": True}, cdf_kwargs={'chunks': {'time':1032}})['ScenarioMIP.CCCma.CanESM5.ssp245.Omon.gn']

ds_g.load()

ValueError                                Traceback (most recent call last)
...
ValueError: destination buffer too small; expected at least 838080, got 419040

Also see https://github.com/swartn/canesm5_pangeo_issues/blob/main/canesm5_r1i1p1f1_pangeo_issue.ipynb

naomi-henderson commented 3 years ago

@swartn , thanks for logging the problem! I couldn't figure out exactly what was wrong, so deleted all of the ScenarioMIP/CCCma/CanESM5/ssp245/r[1-9]i1p1f1/Omon/fgco2/ GC zarr datasets and regenerated them. They seem not to have the 'destination buffer too small' error anymore. Please give it a try.

swartn commented 3 years ago

Thank you @naomi-henderson. It seems to work as expected now for fgco2. Other variables under Omon still seem to have the issue. I tried to loop over these and found issues with the following:

['uo', 'tauvo', 'tos', 'so', 'sos', 'tauuo', 'hfds', 'mlotst', 'zos']

Which was produced with:

# Test all Omon variables from ssp245 for failure to load
cat_url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"       
col = intake.open_esm_datastore(cat_url)
query = dict(experiment_id=['ssp245'], table_id=['Omon'], member_id='r1i1p1f1',
             source_id='CanESM5')
cat = col.search(**query)

failed_list=[]
for variable in cat.df['variable_id']:
    print(f"trying {variable}")
    cat2 = cat.search(variable_id=variable)
    ds_g = cat2.to_dataset_dict()['ScenarioMIP.CCCma.CanESM5.ssp245.Omon.gn']
    try:
        ds_g.load()
    except:
        print(f"Failed on {variable}")
        failed_list.append(variable)
    ds_g.close()
print(failed_list)
naomi-henderson commented 3 years ago

Okay, @swartn, I think I have figured this out. When I now regenerate the affected datasets, the new metadata has creation_date": "2019-05-01T02:35:08Z, and the old had creation_date": "2019-03-14T05:23:12Z. When the CanESM5 folks replaced the netcdf files, they correctly gave the new data a new version label, but, due to my clumsy handling of versions back in the early days of the GC zarr collection, I had given the old datasets the version of the new datasets, not the old. So, although we routinely check for new versions, these particular datasets were not flagged for replacement.

I am now deleting and regenerating the affected datasets. Thank you very much for clearly identifying the affected zarr stores!