intake / intake-esm

An intake plugin for parsing an Earth System Model (ESM) catalog and loading assets into xarray datasets.
https://intake-esm.readthedocs.io
Apache License 2.0
137 stars 46 forks source link

Is there an example on using intake with AWS S3 netcdf files? #290

Closed aradhakrishnanGFDL closed 4 years ago

aradhakrishnanGFDL commented 4 years ago

Hello,

Thank you for intake-esm! I have used intake on locally stored netCDFand zarr data successfully in the past. I recently opened a netCDF dataset in AWS S3 directly with xarray after several failed attempts. I am now working with some model output in netCDF format publicly available in AWS S3. I was trying to get an intake-esm example on this data working for a quick demonstration of our newest experimental JupyterHub setup. I am running into an issue and I hope it's just something minor I overlooked. I was wondering if anyone else has tried using intake with S3 netCDF data source? I notice the NCAR CESM model has done something similar, but in zarr (in AWS S3).

Additional info can be found below. Please let me know if you'd like additional information. Any help is appreciated.

intake                    0.6.0                      py_0    conda-forge
intake-esm                2020.8.15                  py_0    conda-forge
..
..
print(type(data_source))
print(data_source)
<class 'intake_esm.source.ESMGroupDataSource'>
<name: output.NOAA-GFDL.GFDL-ESM4.atmos.historical.mon.Amon, assets: 1
ds = data_source(zarr_kwargs={'consolidated': True, 'decode_times': True}).to_dask()
ds

Error snippets:

AttributeError: 'FSMap' object has no attribute 'tell'

The above exception was the direct cause of the following exception:

OSError                                   Traceback (most recent call last)
<ipython-input-105-b4527cfae305> in <module>
----> 1 ds = data_source(zarr_kwargs={'consolidated': True, 'decode_times': True}).to_dask()
      2 ds

~/my-conda-envs/superenv/lib/python3.8/site-packages/intake_esm/source.py in to_dask(self)
    214     def to_dask(self):
    215         """Return xarray object (which will have chunks)"""
--> 216         self._load_metadata()
    217         return self._ds
    218 

..
OSError: 
            Failed to open netCDF/HDF dataset.

            *** Arguments passed to xarray.open_dataset() ***:

            - filename_or_obj: <fsspec.mapping.FSMap object at 0x7fa1a7312580>
            - kwargs: {'chunks': {}}

            *** fsspec options used ***:

            - root: gfdl-esgf/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Amon/tas/gr1/v20190726
            - protocol: ('s3', 's3a')

            ********************************************
aradhakrishnanGFDL commented 4 years ago

Test code and additional files (referring to the intake-esm catalog examples for glade,etc) in github for reference https://github.com/aradhakrishnanGFDL/roadblocks/blob/main/intake_s3nc.ipynb ESM Collection file https://github.com/aradhakrishnanGFDL/roadblocks/blob/main/gfdltest.json DB/metadata https://github.com/aradhakrishnanGFDL/roadblocks/blob/main/gfdltest.csv

I tried checking out https://github.com/intake/intake-esm/pull/98 and made some edits to be compatible with the current version, but it didn't help thus far.

andersy005 commented 4 years ago

@aradhakrishnanGFDL, Thank you for providing useful debugging information.... I think I have an idea of what's going on... It appears that the reason why accessing netCDF in S3 doesn't work has to do with some assumptions made in intake-esm... Here are the culprit lines:

https://github.com/intake/intake-esm/blob/4ffd85eed14b84e2f10af4ba26ebcc67c379371c/intake_esm/merge_util.py#L12-L15

When dealing with netcdf on S3, instead of calling fsspec.get_mapper(path, **storage_options), we need to call fsspec.open(path, **storage_options).... I will look into supporting this feature in the next release of intake-esm. I will ping you once I have a working prototype for this functionality...

andersy005 commented 4 years ago

@aradhakrishnanGFDL, when you get a chance, could you confirm that the following works for you:

In [17]: import xarray as xr

In [18]: import fsspec

In [19]: fs = fsspec.filesystem('s3', anon=True)

In [20]: x = 's3://gfdl-esgf/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Amon/tas/gr1/v
    ...: 20190726'

In [21]: root = fs.open(x)

I am unable to successfully run fs.open(x) line because it appears that the s3 bucket isn't public (maybe?)

aradhakrishnanGFDL commented 4 years ago

Hi @andersy005,

Thank you for helping with this!

The bucket should be public. I appended the file name to the path and the following works for me. (Note: v20190726 directory has two netcdf files)

import fsspec fs = fsspec.filesystem('s3', anon=True)

one more example if needed x="s3://gfdl-esgf/CMIP6/AerChemMIP/NOAA-GFDL/GFDL-ESM4/histSST/r1i1p1f1/Amon/tas/gr1/v20180701/tas_Amon_GFDL-ESM4_histSST_r1i1p1f1_gr1_185001-194912.nc"

x="s3://gfdl-esgf/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Amon/tas/gr1/v20190726/tas_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_185001-194912.nc" root = fs.open(x)

andersy005 commented 4 years ago

@aradhakrishnanGFDL,

I have a solution for you in #292 :). You will need to modify your path entry in the csv to point to an actual file instead of the directory:

product_id,institute,model,experiment,frequency,modeling_realm,mip_table,ensemble_member,variable,temporal_subset,version,path
output,NOAA-GFDL,GFDL-ESM4,historical,mon,atmos,Amon,r1i1p1,tas,195001-201412,v20190726,s3://gfdl-esgf/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Amon/tas/gr1/v20190726/tas_Amon_GFDL-ESM4_historical_r1i1p1f1_gr1_185001-194912.nc

To try #292 out, you can install intake-esm via

python -m pip install git+https://github.com/andersy005/intake-esm.git@use-fsspec-open-for-netcdf-in-cloud
In [1]: import intake

In [2]: col = intake.open_esm_datastore("gfdltest.json")

In [3]: dset_dict = col.to_dataset_dict(cdf_kwargs={'chunks': {'time': 20}}, storage_options={'anon':True}
   ...: )

--> The keys in the returned dictionary of datasets are constructed as follows:
        'product_id.institute.model.modeling_realm.experiment.frequency.mip_table'
Out[3]: ████████████████████████████| 100.00% [1/1 00:00<00:00]
{'output.NOAA-GFDL.GFDL-ESM4.atmos.historical.mon.Amon': <xarray.Dataset>
 Dimensions:          (bnds: 2, ensemble_member: 1, lat: 180, lon: 288, time: 1200)
 Coordinates:
     height           float64 ...
   * time             (time) object 1850-01-16 12:00:00 ... 1949-12-16 12:00:00
   * lat              (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 87.5 88.5 89.5
   * lon              (lon) float64 0.625 1.875 3.125 4.375 ... 356.9 358.1 359.4
   * bnds             (bnds) float64 1.0 2.0
   * ensemble_member  (ensemble_member) <U6 'r1i1p1'
 Data variables:
     lon_bnds         (lon, bnds) float64 dask.array<chunksize=(288, 2), meta=np.ndarray>
     time_bnds        (time, bnds) object dask.array<chunksize=(20, 2), meta=np.ndarray>
     lat_bnds         (lat, bnds) float64 dask.array<chunksize=(180, 2), meta=np.ndarray>
     tas              (ensemble_member, time, lat, lon) float32 dask.array<chunksize=(1, 20, 180, 288), meta=np.ndarray>
 Attributes:
     external_variables:      areacella
     history:                 File was processed by fremetar (GFDL analog of C...
     table_id:                Amon
     activity_id:             CMIP
     branch_method:           standard
     branch_time_in_child:    [0.]
     branch_time_in_parent:   [36500.]
     comment:                 <null ref>
     contact:                 gfdl.climate.model.info@noaa.gov
     Conventions:             CF-1.7 CMIP-6.0 UGRID-1.0
     creation_date:           2019-07-26T20:13:55Z
     data_specs_version:      01.00.27
     experiment:              all-forcing simulation of the recent past
     experiment_id:           historical
     forcing_index:           [1]
     frequency:               mon
     further_info_url:        https://furtherinfo.es-doc.org/CMIP6.NOAA-GFDL.G...
     grid:                    atmos data regridded from Cubed-sphere (c96) to ...
     grid_label:              gr1
     initialization_index:    [1]
     institution:             National Oceanic and Atmospheric Administration,...
     institution_id:          NOAA-GFDL
     license:                 CMIP6 model data produced by NOAA-GFDL is licens...
     mip_era:                 CMIP6
     nominal_resolution:      100 km
     parent_activity_id:      CMIP
     parent_experiment_id:    piControl
     parent_mip_era:          CMIP6
     parent_source_id:        GFDL-ESM4
     parent_time_units:       days since 0001-1-1
     parent_variant_label:    r1i1p1f1
     physics_index:           [1]
     product:                 model-output
     realization_index:       [1]
     realm:                   atmos
     source:                  GFDL-ESM4 (2018):\natmos: GFDL-AM4.1 (Cubed-sphe...
     source_id:               GFDL-ESM4
     source_type:             AOGCM AER CHEM BGC
     sub_experiment:          none
     sub_experiment_id:       none
     title:                   NOAA GFDL GFDL-ESM4 model output prepared for CM...
     tracking_id:             hdl:21.14100/75e5c5a7-d7c4-4860-beb1-db454f25f13a
     variable_id:             tas
     variant_info:            N/A
     references:              see further_info_url attribute
     variant_label:           r1i1p1f1
     intake_esm_varname:      ['tas']
     intake_esm_dataset_key:  output.NOAA-GFDL.GFDL-ESM4.atmos.historical.mon....}
andersy005 commented 4 years ago

@aradhakrishnanGFDL, I merged #292 into master. When you get a chance, could you try the master branch and let me know how it goes?

python -m pip install git+https://github.com/intake/intake-esm.git
aradhakrishnanGFDL commented 4 years ago

Hi @andersy005 It works great! Thank you so much.