leap-stc / cmip6-leap-feedstock

Apache License 2.0
12 stars 5 forks source link

[REQUEST]: MPI-ESM1-2-HR historical #116

Closed kareed1 closed 5 months ago

kareed1 commented 7 months ago

List of requested idds

'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710',

Description

Hello, On both Google and AWS, the above noted dataset shows that it only contains the years 1915-1959. I'm not sure if this was on purpose. I'd like to request data for Jan 1985-Dec 2014 to be added to the repositories. Thank you for making CMIP6 data easier to access!

jbusecke commented 7 months ago

Hi @kareed1,

thanks for raising an issue here!

I assume you are still using the 'old' catalog file here. Can you provide some more information (small code snipped) on how you are accessing the data currently?

The new current catalog (more info how to access) does not seem to have that iid:

def zstore_to_iid(zstore: str):
    # this is a bit whacky to account for the different way of storing old/new stores
    return '.'.join(zstore.replace('gs://','').replace('.zarr','').replace('.','/').split('/')[-11:-1])

iids_requested = [
'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710',
]

import intake
# uncomment/comment lines to swap catalogs
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json"
col = intake.open_esm_datastore(url)

iids_all= [zstore_to_iid(z) for z in col.df['zstore'].tolist()]
iids_uploaded = [iid for iid in iids_all if iid in iids_requested]
iids_uploaded

gives an empty list.

I will add this to the ingestion and see what we get.

jbusecke commented 7 months ago

See my comments in #119: This seems to require some more deep debugging unfortunately. Well get to the bottom of this eventually!

kareed1 commented 7 months ago

Hi @jbusecke ,

Thank you for the updates and your assistance on this. It probably is from the old catalog. I had found some code online to get started, so I'm not sure how old that code was. Below is an example of the Python code I'm using.

import numpy as np
import pandas as pd
import xarray as xr
import zarr
import gcsfs

#available datasets on Google Cloud
df = pd.read_csv('https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv')

#access to GC data sets
gcs = gcsfs.GCSFileSystem(token='anon')

#query the table
df_atm = df.query("table_id      == 'Amon' & \
                   source_id     ==  'MPI-ESM1-2-HR' & \
                   variable_id   == 'tas'  & \
                   experiment_id == 'historical' & \
                   member_id     == 'r1i1p1f1'")

#retrieve data from Google cloud
var_path = df_atm.zstore.values[0]             #pathway dataset on Google Cloud
mapper = gcs.get_mapper(var_path)              #dataset object
dat = xr.open_zarr(mapper)                     #open the dataset
jbusecke commented 7 months ago

Cool thanks for that info. That all looks good but I recommend using https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/pangeo_esgf_zarr_qc.csv (https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json points to that!) going forward.

jbusecke commented 6 months ago

I have high hopes that a solution to https://github.com/jbusecke/pangeo-forge-esgf/issues/42 will address this issue too.

jbusecke commented 5 months ago

Ok the dataset was ingested, but ended up in our non-qc catalog.

I did some digging:

# you need specific versions of the following libraries to reproduce the following on the LEAP-Pangeo hub
pip install leap-data-management-utils[pangeo-forge] git+https://github.com/jbusecke/pangeo-forge-esgf.git@new-request-scheme

Lets load the store and run out tests (failing these causes this to be put in our non-qc catalog)

import zarr
from pangeo_forge_esgf.utils import facets_from_iid
from leap_data_management_utils.cmip_testing import test_all
import intake
# uncomment/comment lines to swap catalogs
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog_noqc.json" # Only stores that fail current
col = intake.open_esm_datastore(url)
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
facets = facets_from_iid(iid)
del facets['mip_era']
cat = col.search(**facets)
store = zarr.storage.FSStore(cat.df['zstore'].tolist()[0])
test_all(store, iid)

gives

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[24], line 15
     12 store = zarr.storage.FSStore(cat.df['zstore'].tolist()[0])
     13 # ds = xr.open_dataset(store, engine='zarr')
     14 # ds
---> 15 test_all(store, iid)

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py:72](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py#line=71), in test_all(store, iid, verbose)
     70 def test_all(store: zarr.storage.FSStore, iid: str, verbose=True) -> zarr.storage.FSStore:
     71     ds = test_open_store(store, verbose=verbose)
---> 72     test_time(ds, verbose=verbose)
     73     test_attributes(ds, iid, verbose=verbose)
     74     return store

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py:49](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py#line=48), in test_time(ds, verbose)
     47 if verbose:
     48     print(time_diff)
---> 49 assert (time_diff > 0).all()
     51 # assert that there are no large time gaps
     52 mean_time_diff = time_diff.mean()

AssertionError:

so the time is not continous!

we can confirm that

import matplotlib.pyplot as plt
import xarray as xr
ds = xr.open_dataset(store, engine='zarr')
plt.plot(ds.time) # note do not use the built in plot since it will seem like the time is continous, because the time is plotted against itself not the array index
image

Yeah thats not great...but its fixable!

plt.plot(ds.sortby('time').time)

so @kareed1 you can use the above to work with the dataset for now.

I want to understand how this happened though...

from pangeo_forge_esgf.client import ESGFClient
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
client = ESGFClient()
dataset_id = client.get_instance_id_input([iid])[iid]['id']
file_dict = client.get_recipe_inputs_from_dataset_ids([dataset_id])
list(file_dict[iid].keys())

this seems fine

'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197501-197912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_198001-198412.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_198501-198912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_199001-199412.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_199501-199912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_200001-200412.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_200501-200912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_201001-201412.nc'

My first suspicion was that the files are not correctly concatenated, but that might not be it. Will dig some more and follow up.

jbusecke commented 5 months ago

Oh wait, this is not a complete set of files! How strange.

jbusecke commented 5 months ago

Ill move discussion over to https://github.com/jbusecke/pangeo-forge-esgf/issues/46, but will close this for now. Feel free to use the non-qc data for now, but proceed with caution @kareed1