Closed aradhakrishnanGFDL closed 4 years ago
Test code and additional files (referring to the intake-esm catalog examples for glade,etc) in github for reference https://github.com/aradhakrishnanGFDL/roadblocks/blob/main/intake_s3nc.ipynb ESM Collection file https://github.com/aradhakrishnanGFDL/roadblocks/blob/main/gfdltest.json DB/metadata https://github.com/aradhakrishnanGFDL/roadblocks/blob/main/gfdltest.csv
I tried checking out https://github.com/intake/intake-esm/pull/98 and made some edits to be compatible with the current version, but it didn't help thus far.
@aradhakrishnanGFDL, Thank you for providing useful debugging information.... I think I have an idea of what's going on... It appears that the reason why accessing netCDF in S3 doesn't work has to do with some assumptions made in intake-esm... Here are the culprit lines:
When dealing with netcdf on S3, instead of calling fsspec.get_mapper(path, **storage_options)
, we need to call fsspec.open(path, **storage_options)
.... I will look into supporting this feature in the next release of intake-esm. I will ping you once I have a working prototype for this functionality...
@aradhakrishnanGFDL, when you get a chance, could you confirm that the following works for you:
In [17]: import xarray as xr
In [18]: import fsspec
In [19]: fs = fsspec.filesystem('s3', anon=True)
In [20]: x = 's3://gfdl-esgf/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Amon/tas/gr1/v
...: 20190726'
In [21]: root = fs.open(x)
I am unable to successfully run fs.open(x)
line because it appears that the s3 bucket isn't public (maybe?)
Hi @andersy005,
Thank you for helping with this!
The bucket should be public. I appended the file name to the path and the following works for me. (Note: v20190726 directory has two netcdf files)
import fsspec fs = fsspec.filesystem('s3', anon=True)
x="s3://gfdl-esgf/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Amon/tas/gr1/v20190726/tas_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_185001-194912.nc" root = fs.open(x)
@aradhakrishnanGFDL,
I have a solution for you in #292 :). You will need to modify your path
entry in the csv to point to an actual file instead of the directory:
product_id,institute,model,experiment,frequency,modeling_realm,mip_table,ensemble_member,variable,temporal_subset,version,path
output,NOAA-GFDL,GFDL-ESM4,historical,mon,atmos,Amon,r1i1p1,tas,195001-201412,v20190726,s3://gfdl-esgf/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Amon/tas/gr1/v20190726/tas_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_185001-194912.nc
To try #292 out, you can install intake-esm via
python -m pip install git+https://github.com/andersy005/intake-esm.git@use-fsspec-open-for-netcdf-in-cloud
In [1]: import intake
In [2]: col = intake.open_esm_datastore("gfdltest.json")
In [3]: dset_dict = col.to_dataset_dict(cdf_kwargs={'chunks': {'time': 20}}, storage_options={'anon':True}
...: )
--> The keys in the returned dictionary of datasets are constructed as follows:
'product_id.institute.model.modeling_realm.experiment.frequency.mip_table'
Out[3]: ████████████████████████████| 100.00% [1/1 00:00<00:00]
{'output.NOAA-GFDL.GFDL-ESM4.atmos.historical.mon.Amon': <xarray.Dataset>
Dimensions: (bnds: 2, ensemble_member: 1, lat: 180, lon: 288, time: 1200)
Coordinates:
height float64 ...
* time (time) object 1850-01-16 12:00:00 ... 1949-12-16 12:00:00
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 87.5 88.5 89.5
* lon (lon) float64 0.625 1.875 3.125 4.375 ... 356.9 358.1 359.4
* bnds (bnds) float64 1.0 2.0
* ensemble_member (ensemble_member) <U6 'r1i1p1'
Data variables:
lon_bnds (lon, bnds) float64 dask.array<chunksize=(288, 2), meta=np.ndarray>
time_bnds (time, bnds) object dask.array<chunksize=(20, 2), meta=np.ndarray>
lat_bnds (lat, bnds) float64 dask.array<chunksize=(180, 2), meta=np.ndarray>
tas (ensemble_member, time, lat, lon) float32 dask.array<chunksize=(1, 20, 180, 288), meta=np.ndarray>
Attributes:
external_variables: areacella
history: File was processed by fremetar (GFDL analog of C...
table_id: Amon
activity_id: CMIP
branch_method: standard
branch_time_in_child: [0.]
branch_time_in_parent: [36500.]
comment: <null ref>
contact: gfdl.climate.model.info@noaa.gov
Conventions: CF-1.7 CMIP-6.0 UGRID-1.0
creation_date: 2019-07-26T20:13:55Z
data_specs_version: 01.00.27
experiment: all-forcing simulation of the recent past
experiment_id: historical
forcing_index: [1]
frequency: mon
further_info_url: https://furtherinfo.es-doc.org/CMIP6.NOAA-GFDL.G...
grid: atmos data regridded from Cubed-sphere (c96) to ...
grid_label: gr1
initialization_index: [1]
institution: National Oceanic and Atmospheric Administration,...
institution_id: NOAA-GFDL
license: CMIP6 model data produced by NOAA-GFDL is licens...
mip_era: CMIP6
nominal_resolution: 100 km
parent_activity_id: CMIP
parent_experiment_id: piControl
parent_mip_era: CMIP6
parent_source_id: GFDL-ESM4
parent_time_units: days since 0001-1-1
parent_variant_label: r1i1p1f1
physics_index: [1]
product: model-output
realization_index: [1]
realm: atmos
source: GFDL-ESM4 (2018):\natmos: GFDL-AM4.1 (Cubed-sphe...
source_id: GFDL-ESM4
source_type: AOGCM AER CHEM BGC
sub_experiment: none
sub_experiment_id: none
title: NOAA GFDL GFDL-ESM4 model output prepared for CM...
tracking_id: hdl:21.14100/75e5c5a7-d7c4-4860-beb1-db454f25f13a
variable_id: tas
variant_info: N/A
references: see further_info_url attribute
variant_label: r1i1p1f1
intake_esm_varname: ['tas']
intake_esm_dataset_key: output.NOAA-GFDL.GFDL-ESM4.atmos.historical.mon....}
@aradhakrishnanGFDL, I merged #292 into master. When you get a chance, could you try the master branch and let me know how it goes?
python -m pip install git+https://github.com/intake/intake-esm.git
Hi @andersy005 It works great! Thank you so much.
Hello,
Thank you for intake-esm! I have used intake on locally stored netCDFand zarr data successfully in the past. I recently opened a netCDF dataset in AWS S3 directly with xarray after several failed attempts. I am now working with some model output in netCDF format publicly available in AWS S3. I was trying to get an intake-esm example on this data working for a quick demonstration of our newest experimental JupyterHub setup. I am running into an issue and I hope it's just something minor I overlooked. I was wondering if anyone else has tried using intake with S3 netCDF data source? I notice the NCAR CESM model has done something similar, but in zarr (in AWS S3).
Additional info can be found below. Please let me know if you'd like additional information. Any help is appreciated.