Closed kareed1 closed 5 months ago
Hi @kareed1,
thanks for raising an issue here!
I assume you are still using the 'old' catalog file here. Can you provide some more information (small code snipped) on how you are accessing the data currently?
The new current catalog (more info how to access) does not seem to have that iid:
def zstore_to_iid(zstore: str):
# this is a bit whacky to account for the different way of storing old/new stores
return '.'.join(zstore.replace('gs://','').replace('.zarr','').replace('.','/').split('/')[-11:-1])
iids_requested = [
'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710',
]
import intake
# uncomment/comment lines to swap catalogs
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json"
col = intake.open_esm_datastore(url)
iids_all= [zstore_to_iid(z) for z in col.df['zstore'].tolist()]
iids_uploaded = [iid for iid in iids_all if iid in iids_requested]
iids_uploaded
gives an empty list.
I will add this to the ingestion and see what we get.
See my comments in #119: This seems to require some more deep debugging unfortunately. Well get to the bottom of this eventually!
Hi @jbusecke ,
Thank you for the updates and your assistance on this. It probably is from the old catalog. I had found some code online to get started, so I'm not sure how old that code was. Below is an example of the Python code I'm using.
import numpy as np
import pandas as pd
import xarray as xr
import zarr
import gcsfs
#available datasets on Google Cloud
df = pd.read_csv('https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv')
#access to GC data sets
gcs = gcsfs.GCSFileSystem(token='anon')
#query the table
df_atm = df.query("table_id == 'Amon' & \
source_id == 'MPI-ESM1-2-HR' & \
variable_id == 'tas' & \
experiment_id == 'historical' & \
member_id == 'r1i1p1f1'")
#retrieve data from Google cloud
var_path = df_atm.zstore.values[0] #pathway dataset on Google Cloud
mapper = gcs.get_mapper(var_path) #dataset object
dat = xr.open_zarr(mapper) #open the dataset
Cool thanks for that info. That all looks good but I recommend using https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/pangeo_esgf_zarr_qc.csv
(https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json
points to that!) going forward.
I have high hopes that a solution to https://github.com/jbusecke/pangeo-forge-esgf/issues/42 will address this issue too.
Ok the dataset was ingested, but ended up in our non-qc catalog.
I did some digging:
# you need specific versions of the following libraries to reproduce the following on the LEAP-Pangeo hub
pip install leap-data-management-utils[pangeo-forge] git+https://github.com/jbusecke/pangeo-forge-esgf.git@new-request-scheme
Lets load the store and run out tests (failing these causes this to be put in our non-qc catalog)
import zarr
from pangeo_forge_esgf.utils import facets_from_iid
from leap_data_management_utils.cmip_testing import test_all
import intake
# uncomment/comment lines to swap catalogs
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog_noqc.json" # Only stores that fail current
col = intake.open_esm_datastore(url)
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
facets = facets_from_iid(iid)
del facets['mip_era']
cat = col.search(**facets)
store = zarr.storage.FSStore(cat.df['zstore'].tolist()[0])
test_all(store, iid)
gives
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[24], line 15
12 store = zarr.storage.FSStore(cat.df['zstore'].tolist()[0])
13 # ds = xr.open_dataset(store, engine='zarr')
14 # ds
---> 15 test_all(store, iid)
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py:72](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py#line=71), in test_all(store, iid, verbose)
70 def test_all(store: zarr.storage.FSStore, iid: str, verbose=True) -> zarr.storage.FSStore:
71 ds = test_open_store(store, verbose=verbose)
---> 72 test_time(ds, verbose=verbose)
73 test_attributes(ds, iid, verbose=verbose)
74 return store
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py:49](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py#line=48), in test_time(ds, verbose)
47 if verbose:
48 print(time_diff)
---> 49 assert (time_diff > 0).all()
51 # assert that there are no large time gaps
52 mean_time_diff = time_diff.mean()
AssertionError:
so the time is not continous!
we can confirm that
import matplotlib.pyplot as plt
import xarray as xr
ds = xr.open_dataset(store, engine='zarr')
plt.plot(ds.time) # note do not use the built in plot since it will seem like the time is continous, because the time is plotted against itself not the array index
Yeah thats not great...but its fixable!
plt.plot(ds.sortby('time').time)
so @kareed1 you can use the above to work with the dataset for now.
I want to understand how this happened though...
from pangeo_forge_esgf.client import ESGFClient
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
client = ESGFClient()
dataset_id = client.get_instance_id_input([iid])[iid]['id']
file_dict = client.get_recipe_inputs_from_dataset_ids([dataset_id])
list(file_dict[iid].keys())
this seems fine
'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197501-197912.nc',
'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_198001-198412.nc',
'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_198501-198912.nc',
'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_199001-199412.nc',
'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_199501-199912.nc',
'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_200001-200412.nc',
'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_200501-200912.nc',
'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_201001-201412.nc'
My first suspicion was that the files are not correctly concatenated, but that might not be it. Will dig some more and follow up.
Oh wait, this is not a complete set of files! How strange.
Ill move discussion over to https://github.com/jbusecke/pangeo-forge-esgf/issues/46, but will close this for now. Feel free to use the non-qc data for now, but proceed with caution @kareed1
List of requested idds
Description
Hello, On both Google and AWS, the above noted dataset shows that it only contains the years 1915-1959. I'm not sure if this was on purpose. I'd like to request data for Jan 1985-Dec 2014 to be added to the repositories. Thank you for making CMIP6 data easier to access!