finding various implementation errors in the netcdf file tracking_ids

pangeo-forge / cmip6-pipeline

Pipeline for cloud-based CMIP6 data ingestion pipeline

Apache License 2.0

1 stars 5 forks source link

The dataset version has always caused trouble in our cmip6 pipeline. It is the only DRS element which is not stored in the netcdf file's metadata. However, we use the version to keep track of Datasets which have been modified. I have been using the tracking_id to get the version (using the http://hdl.handle.net / https://handle-esgf.dkrz.de data handling service), somewhat successfully to obtain the version information.

But many implementation errors pop up. The gs://cmip6 zarr Datasets have tracking_ids which are concatenations of the netcdf file tracking_ids from which it is aggregated. In a perfect world, all of these tracking_ids would correspond to one and only one netcdf file and each netcdf file would correspond to one and only one version. So I am collecting and categorizing the various issues and trying to come up with some sensible work-arounds. I will be collecting them here, if you want to help ...

dfcat = pd.read_csv('https://cmip6.storage.googleapis.com/cmip6-zarr-consolidated-stores-noQC.csv', dtype='unicode') gsurl = 'gs://cmip6/CMIP/NCAR/CESM2/historical/r11i1p1f1/Oyr/expc/gr/' version_cat = dfcat[dfcat.zstore == gsurl].version.values[0] print('current version from GC catalog = ',version_cat) tracks = gsurl2tracks(gsurl) (version,jdict) = tracks2version(tracks) print('latest version from handler = ', version) asearch = gsurl2search(gsurl) dfs = esgf_search(asearch, toFilter = False) version_ESGF = list(set(dfs.version_id)) print('version(s) available from ESGF = ', version_ESGF)

pangeo-forge / cmip6-pipeline

finding various implementation errors in the netcdf file tracking_ids #11