intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
76 stars 36 forks source link

Remote path for Rasterio driver #31

Closed davidbrochart closed 5 years ago

davidbrochart commented 5 years ago

Passing a remote path to the Rasterio driver doesn't seem to work, as urlpath is directly passed to Rasterio. I get the following error with this catalog:

plugins:
  source:
    - module: intake_xarray
sources:
  hydrosheds:
    description: Flow accumulation at 3-second resolution
    metadata:
      url: 'https://www.hydrosheds.org'
      tags:
        - flow
    driver: rasterio
    args:
      urlpath: 'gcs://pangeo-data/hydrosheds/acc.vrt'
      chunks: {'lat': 6000, 'lon': 6000}
---------------------------------------------------------------------------
CPLE_OpenFailedError                      Traceback (most recent call last)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

rasterio/_shim.pyx in rasterio._shim.open_dataset()

rasterio/_err.pyx in rasterio._err.exc_wrap_pointer()

CPLE_OpenFailedError: gcs://pangeo-data/hydrosheds/acc.vrt: No such file or directory

During handling of the above exception, another exception occurred:

RasterioIOError                           Traceback (most recent call last)
<ipython-input-25-94f3048d9635> in <module>()
----> 1 acc = hydrosheds.to_dask()

/srv/conda/lib/python3.6/site-packages/intake_xarray/base.py in to_dask(self)
     68     def to_dask(self):
     69         """Return xarray object where variables are dask arrays"""
---> 70         return self.read_chunked()
     71 
     72     def close(self):

/srv/conda/lib/python3.6/site-packages/intake_xarray/base.py in read_chunked(self)
     43     def read_chunked(self):
     44         """Return xarray object (which will have chunks)"""
---> 45         self._load_metadata()
     46         return self._ds
     47 

/srv/conda/lib/python3.6/site-packages/intake/source/base.py in _load_metadata(self)
    128         """load metadata only if needed"""
    129         if self._schema is None:
--> 130             self._schema = self._get_schema()
    131             self.datashape = self._schema.datashape
    132             self.dtype = self._schema.dtype

/srv/conda/lib/python3.6/site-packages/intake_xarray/raster.py in _get_schema(self)
     91 
     92         if self._ds is None:
---> 93             self._open_dataset()
     94 
     95             ds2 = xr.Dataset({'raster': self._ds})

/srv/conda/lib/python3.6/site-packages/intake_xarray/raster.py in _open_dataset(self)
     81         else:
     82             self._ds = xr.open_rasterio(self.urlpath, chunks=self.chunks,
---> 83                                         **self._kwargs)
     84 
     85     def _get_schema(self):

/srv/conda/lib/python3.6/site-packages/xarray/backends/rasterio_.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock)
    212 
    213     # Get bands
--> 214     if riods.value.count < 1:
    215         raise ValueError('Unknown dims')
    216     coords['band'] = np.asarray(riods.value.indexes)

/srv/conda/lib/python3.6/site-packages/xarray/backends/common.py in value(self)
    526     @property
    527     def value(self):
--> 528         self._ds = self.opener()
    529         return self._ds
    530 

/srv/conda/lib/python3.6/site-packages/rasterio/env.py in wrapper(*args, **kwds)
    419 
    420         with env_ctor(session=session):
--> 421             return f(*args, **kwds)
    422 
    423     return wrapper

/srv/conda/lib/python3.6/site-packages/rasterio/__init__.py in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, **kwargs)
    214         # None.
    215         if mode == 'r':
--> 216             s = DatasetReader(path, driver=driver, **kwargs)
    217         elif mode == 'r+':
    218             s = get_writer_for_path(path)(path, mode, driver=driver, **kwargs)

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: gcs://pangeo-data/hydrosheds/acc.vrt: No such file or directory
scottyhq commented 5 years ago

Hi @davidbrochart, looks like your path should start with gs:// not gcs:// . In theory this should work, but you may run into issues with google authentication... A simple work around for now is to grant public access for the bucket you are working with, so that each file has a http:// access point.

Also, careful with VRT files if you are running dask distributed, since I'm guessing each worker will need a local copy of the VRT!

make sure you are have the gdal library > 2.3 https://www.gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsigs

and the latest versions of rasterio are simplifying the process of authentication https://github.com/mapbox/rasterio/pull/1577/files/4a441cbcd2beff6c5fe5acce5b54644cc56839d4

jsignell commented 5 years ago

If you are comfortable with caching the data locally, that would be a way around any limitations that rasterio might have with acceptable file formats. To enable that add this to your catalog entry:

    driver: rasterio
    cache:
      - argkey: urlpath
        regex: 'pangeo-data'
        type: file
davidbrochart commented 5 years ago

Thanks @scottyhq. Rasterio v1.0.14 has not been released on conda yet, I think that's why it doesn't work yet. I won't access the dataset with dask, so that's not an issue. @jsignell, this works better, now I have an authentication error as @scottyhq mentioned, but that should be easy to solve. Thanks!

davidbrochart commented 5 years ago

It works fine using @jsignell's caching mechanism. Closing the issue, thanks again!