pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Support creating DataSet from streaming object #1075

Closed delgadom closed 6 years ago

delgadom commented 8 years ago

The use case is for netCDF files stored on s3 or other generic cloud storage

import requests, xarray as xr
fp = 'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp45_r1i1p1_MPI-ESM-LR_2029.nc'

data = requests.get(fp, stream=True)
ds = xr.open_dataset(data.content)  # raises TypeError: embedded NUL character

Ideal would be integration with the (hopefully) soon-to-be implemented dask.distributed features discussed in #798.

shoyer commented 8 years ago

This does work for netCDF3 files, if you provide a file-like object (e.g., wrapped in BytesIO) or set engine='scipy'.

Unfortunately, this is a netCDF4/HDF5 file:

>>> data.raw.read(8)
'\x89HDF\r\n\x1a\n'

And as yet, there is no support for reading from file-like objects in either h5py (https://github.com/h5py/h5py/issues/552) or python-netCDF4 (https://github.com/Unidata/netcdf4-python/issues/295). So we're currently stuck :(.

One possibility is to use the new HDF5 library pyfive with h5netcdf (https://github.com/shoyer/h5netcdf/issues/25). But pyfive doesn't have enough features yet to read netCDF files.

delgadom commented 8 years ago

Got it. :( Thanks!

rabernat commented 7 years ago

Is this issue resolvable now that unidata/netcdf4-python#652 has been merged?

shoyer commented 7 years ago

Yes, we could support initializing a Dataset from netCDF4 file image in a bytes object.

niallrobinson commented 6 years ago

FWIW this would be really useful 👍 from me, specifically for the use case above of reading from s3

shoyer commented 6 years ago

Just to clarify: I wrote about that we use could support initializing a Dataset from a netCDF4 file image. But this wouldn't help yet for streaming access.

Initializing a Dataset from a netCDF4 file image should actually work with the latest versions of xarray and netCDF4-python:

nc4_ds = netCDF4.Dataset('arbitrary-name', memory=netcdf_bytes)
store = xarray.backends.NetCDF4DataStore(nc4_ds)
ds = xarray.open_dataset(store)
delgadom commented 6 years ago

Thanks @shoyer. So you can download the entire object into memory and then create a file image and read that? While not a full fix, it's definitely an improvement over download-to-disk-then-read workflow!

shoyer commented 6 years ago

@delgadom Yes, that should work (I haven't tested it, but yes in principle it should all work now).

jhamman commented 6 years ago

@delgadom - did you find a solution here?

A few more references, we're exploring ways to do this in the Pangeo project using Fuse (https://github.com/pangeo-data/pangeo/issues/52). There is a s3 equivalent of the gcsfs library used in that issue: https://github.com/dask/s3fs

delgadom commented 6 years ago

yes! Thanks @jhamman and @shoyer. I hadn't tried it yet, but just did. worked great!

In  [1]: import xarray as xr
    ...: import requests
    ...: import netCDF4
    ...: 
    ...: %matplotlib inline

In  [2]: res = requests.get(
    ...:     'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmin/' +
    ...:     'r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073.nc')

In  [3]: res.status_code
Out [3]: 200

In  [4]: res.headers['content-type']
Out [4]: 'application/x-netcdf'

In  [5]: nc4_ds = netCDF4.Dataset('tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073', memory=res.content)

In  [6]: store = xr.backends.NetCDF4DataStore(nc4_ds)

In  [7]: ds = xr.open_dataset(store)

In  [8]: ds.tasmin.isel(time=0).plot()
    /global/home/users/mdelgado/git/public/xarray/xarray/plot/utils.py:51: FutureWarning: 'pandas.tseries.converter.register' has been moved and renamed to 'pandas.plotting.register_matplotlib_converters'. 
      converter.register()
Out [8]: <matplotlib.collections.QuadMesh at 0x2aede3c922b0>

output_7_2

In  [9]: ds
Out [9]:
    <xarray.Dataset>
    Dimensions:  (lat: 720, lon: 1440, time: 365)
    Coordinates:
      * time     (time) datetime64[ns] 2073-01-01T12:00:00 2073-01-02T12:00:00 ...
      * lat      (lat) float32 -89.875 -89.625 -89.375 -89.125 -88.875 -88.625 ...
      * lon      (lon) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 1.875 ...
    Data variables:
        tasmin   (time, lat, lon) float64 ...
    Attributes:
        parent_experiment:              historical
        parent_experiment_id:           historical
        parent_experiment_rip:          r1i1p1
        Conventions:                    CF-1.4
        institution:                    NASA Earth Exchange, NASA Ames Research C...
        institute_id:                   NASA-Ames
        realm:                          atmos
        modeling_realm:                 atmos
        version:                        1.0
        downscalingModel:               BCSD
        experiment_id:                  rcp45
        frequency:                      day
        realization:                    1
        initialization_method:          1
        physics_version:                1
        tracking_id:                    1865ff49-b20c-4268-852a-a9503efec72c
        driving_data_tracking_ids:      N/A
        driving_model_ensemble_member:  r1i1p1
        driving_experiment_name:        historical
        driving_experiment:             historical
        model_id:                       BCSD
        references:                     BCSD method: Thrasher et al., 2012, Hydro...
        DOI:                            http://dx.doi.org/10.7292/W0MW2F2G
        experiment:                     RCP4.5
        title:                          CESM1-BGC global downscaled NEX CMIP5 Cli...
        contact:                        Dr. Rama Nemani: rama.nemani@nasa.gov, Dr...
        disclaimer:                     This data is considered provisional and s...
        resolution_id:                  0.25 degree
        project_id:                     NEXGDDP
        table_id:                       Table day (12 November 2010)
        source:                         BCSD 2014
        creation_date:                  2015-01-07T19:18:31Z
        forcing:                        N/A
        product:                        output
shoyer commented 6 years ago

We could potentially add a from_memory() constructor to NetCDF4DataStore to simplify this process. On Thu, Jan 11, 2018 at 6:27 PM Michael Delgado notifications@github.com wrote:

yes! Thanks @jhamman https://github.com/jhamman and @shoyer https://github.com/shoyer. I hadn't tried it yet, but just did. worked great!

In [1]: import xarray as xr ...: import requests ...: import netCDF4 ...: ...: %matplotlib inline

In [2]: res = requests.get( ...: 'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/rcp45/day/atmos/tasmin/' + ...: 'r1i1p1/v1.0/tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073.nc')

In [3]: res.status_code Out [3]: 200

In [4]: res.headers['content-type'] Out [4]: 'application/x-netcdf'

In [5]: nc4_ds = netCDF4.Dataset('tasmin_day_BCSD_rcp45_r1i1p1_CESM1-BGC_2073', memory=res.content)

In [6]: store = xr.backends.NetCDF4DataStore(nc4_ds)

In [7]: ds = xr.open_dataset(store)

In [8]: ds.tasmin.isel(time=0).plot() /global/home/users/mdelgado/git/public/xarray/xarray/plot/utils.py:51: FutureWarning: 'pandas.tseries.converter.register' has been moved and renamed to 'pandas.plotting.register_matplotlib_converters'. converter.register() Out [8]: <matplotlib.collections.QuadMesh at 0x2aede3c922b0>

[image: output_7_2] https://user-images.githubusercontent.com/3698640/34856943-f82619f4-f6fc-11e7-831d-f5d4032a338a.png

In [9]: ds Out [9]:

Dimensions: (lat: 720, lon: 1440, time: 365) Coordinates: * time (time) datetime64[ns] 2073-01-01T12:00:00 2073-01-02T12:00:00 ... * lat (lat) float32 -89.875 -89.625 -89.375 -89.125 -88.875 -88.625 ... * lon (lon) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 1.875 ... Data variables: tasmin (time, lat, lon) float64 ... Attributes: parent_experiment: historical parent_experiment_id: historical parent_experiment_rip: r1i1p1 Conventions: CF-1.4 institution: NASA Earth Exchange, NASA Ames Research C... institute_id: NASA-Ames realm: atmos modeling_realm: atmos version: 1.0 downscalingModel: BCSD experiment_id: rcp45 frequency: day realization: 1 initialization_method: 1 physics_version: 1 tracking_id: 1865ff49-b20c-4268-852a-a9503efec72c driving_data_tracking_ids: N/A driving_model_ensemble_member: r1i1p1 driving_experiment_name: historical driving_experiment: historical model_id: BCSD references: BCSD method: Thrasher et al., 2012, Hydro... DOI: http://dx.doi.org/10.7292/W0MW2F2G experiment: RCP4.5 title: CESM1-BGC global downscaled NEX CMIP5 Cli... contact: Dr. Rama Nemani: rama.nemani@nasa.gov, Dr... disclaimer: This data is considered provisional and s... resolution_id: 0.25 degree project_id: NEXGDDP table_id: Table day (12 November 2010) source: BCSD 2014 creation_date: 2015-01-07T19:18:31Z forcing: N/A product: output — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .
nickwg03 commented 6 years ago

@delgadom which version of netCDF4 are you using? I'm following your same steps but am still receiving an [Errno 2] No such file or directory

delgadom commented 6 years ago

xarray==0.10.2 netCDF4==1.3.1

Just tried it again and didn't have any issues:

patt = (
    'http://nasanex.s3.amazonaws.com/NEX-GDDP/BCSD/{scen}/day/atmos/{var}/' +
    'r1i1p1/v1.0/{var}_day_BCSD_{scen}_r1i1p1_{model}_{year}.nc')

def open_url_dataset(url):

    fname = os.path.splitext(os.path.basename(url))[0]
    res = requests.get(url)
    content = io.BytesIO(res.content)
    nc4_ds = netCDF4.Dataset(fname, memory=res.content)

    store = xr.backends.NetCDF4DataStore(nc4_ds)
    ds = xr.open_dataset(store)

    return ds

ds = open_url_dataset(url=patt.format(
        model='GFDL-ESM2G', scen='historical', var='tasmax', year=1988))
ds
nickwg03 commented 6 years ago

@delgadom Ah, I see. I needed libnetcdf=4.5.0, I had been using an earlier version. Sounds like prior to 4.5.0 there were still some issues with the name of the file being passed into netCDF4.Dataset, as is mentioned here: https://github.com/Unidata/netcdf4-python/issues/295

JackKelly commented 4 years ago

Is this now implemented (and hence can this issue be closed?) It appears that this works well:

    boto_s3 = boto3.client('s3')
    s3_object = boto_s3.get_object(Bucket=bucket, Key=key)
    netcdf_bytes = s3_object['Body'].read()
    netcdf_bytes_io = io.BytesIO(netcdf_bytes)
    ds = xr.open_dataset(netcdf_bytes_io)

Is that the right approach to opening a NetCDF file on S3, using the latest xarray code?

JackKelly commented 4 years ago

FWIW, I've also tested @delgadom's technique, using netCDF4 and it also works well (and is useful in situations where we don't want to install h5netcdf). Thanks!