pangeo-data / pangeo-datastore

Pangeo Cloud Datastore
https://catalog.pangeo.io
48 stars 16 forks source link

Problems loading cloud-optimized-geotiff from Pangeo's catalog #109

Closed jhamman closed 3 years ago

jhamman commented 4 years ago

I am trying to open a cloud optimized geotiff in one of Pangeo's GCS buckets. Opening this dataset worked in the past but has stopped working. One thing to consider is that the bucket was recently (#95) moved to a requestor pays bucket. I have confirmed the dataset in question is present in gcs.

The following snippet of code once ran on ocean.pangeo.io, but now returns the error below.

MCVE Code Sample

import intake
cat2 = intake.open_catalog('https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml')
display(cat2.hydro.soil_grids_single_level.describe())
cat2.hydro.soil_grids_single_level.to_dask()

returns

{'name': 'soil_grids_single_level',
 'container': 'xarray',
 'plugin': ['rasterio'],
 'description': 'Global gridded soil information in COG format',
 'direct_access': 'forbid',
 'user_parameters': [{'name': 'variable',
   'description': 'soil variable',
   'type': 'str',
   'default': 'TAXNWRB'}],
 'metadata': {'url': 'https://soilgrids.org/', 'tags': 'Soil'},
 'args': {'urlpath': 'gs://pangeo-ncar-soilgrids/{{ variable }}_250m.tif',
  'chunks': {'y': 5120, 'x': 5120},
  'storage_options': {'requester_pays': True}}}

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    197             try:
--> 198                 file = self._cache[self._key]
    199             except KeyError:

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
     52         with self._lock:
---> 53             value = self._cache[key]
     54             self._cache.move_to_end(key)

KeyError: [<function open at 0x7f516dc838c8>, ('gs://pangeo-ncar-soilgrids/TAXNWRB_250m.tif',), 'r', ()]

During handling of the above exception, another exception occurred:

CPLE_OpenFailedError                      Traceback (most recent call last)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

rasterio/_shim.pyx in rasterio._shim.open_dataset()

rasterio/_err.pyx in rasterio._err.exc_wrap_pointer()

CPLE_OpenFailedError: '/vsigs/pangeo-ncar-soilgrids/TAXNWRB_250m.tif' does not exist in the file system, and is not recognized as a supported dataset name.

During handling of the above exception, another exception occurred:

RasterioIOError                           Traceback (most recent call last)
<ipython-input-5-9b7a4e7350bd> in <module>
      2 cat2 = intake.open_catalog('https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml')
      3 display(cat2.hydro.soil_grids_single_level.describe())
----> 4 cat2.hydro.soil_grids_single_level.to_dask()

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
     67     def to_dask(self):
     68         """Return xarray object where variables are dask arrays"""
---> 69         return self.read_chunked()
     70 
     71     def close(self):

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
     42     def read_chunked(self):
     43         """Return xarray object (which will have chunks)"""
---> 44         self._load_metadata()
     45         return self._ds
     46 

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
    115         """load metadata only if needed"""
    116         if self._schema is None:
--> 117             self._schema = self._get_schema()
    118             self.datashape = self._schema.datashape
    119             self.dtype = self._schema.dtype

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/raster.py in _get_schema(self)
     93 
     94         if self._ds is None:
---> 95             self._open_dataset()
     96 
     97             ds2 = xr.Dataset({'raster': self._ds})

/srv/conda/envs/notebook/lib/python3.7/site-packages/intake_xarray/raster.py in _open_dataset(self)
     82         else:
     83             self._ds = xr.open_rasterio(self.urlpath, chunks=self.chunks,
---> 84                                         **self._kwargs)
     85 
     86     def _get_schema(self):

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/backends/rasterio_.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock)
    237 
    238     manager = CachingFileManager(rasterio.open, filename, lock=lock, mode="r")
--> 239     riods = manager.acquire()
    240     if vrt_params is not None:
    241         riods = WarpedVRT(riods, **vrt_params)

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/backends/file_manager.py in acquire(self, needs_lock)
    178         An open file object, as returned by ``opener(*args, **kwargs)``.
    179         """
--> 180         file, _ = self._acquire_with_cache_info(needs_lock)
    181         return file
    182 

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    202                     kwargs = kwargs.copy()
    203                     kwargs["mode"] = self._mode
--> 204                 file = self._opener(*self._args, **kwargs)
    205                 if self._mode == "w":
    206                     # ensure file doesn't get overriden when opened again

/srv/conda/envs/notebook/lib/python3.7/site-packages/rasterio/env.py in wrapper(*args, **kwds)
    443 
    444         with env_ctor(session=session):
--> 445             return f(*args, **kwds)
    446 
    447     return wrapper

/srv/conda/envs/notebook/lib/python3.7/site-packages/rasterio/__init__.py in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, **kwargs)
    214         # None.
    215         if mode == 'r':
--> 216             s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
    217         elif mode == 'r+':
    218             s = get_writer_for_path(path)(path, mode, driver=driver, sharing=sharing, **kwargs)

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: '/vsigs/pangeo-ncar-soilgrids/TAXNWRB_250m.tif' does not exist in the file system, and is not recognized as a supported dataset name.

Version info

(All running on ocean.pangeo.io, JUPYTER_IMAGE_SPEC=us.gcr.io/pangeo-181919/ocean-pangeo-io-notebook:b36b8b7)

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.19.102+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.6.2 xarray: 0.15.1 pandas: 0.25.3 numpy: 1.17.3 scipy: 1.3.2 netCDF4: 1.5.1.2 pydap: installed h5netcdf: 0.7.4 h5py: 2.10.0 Nio: None zarr: 2.3.2 cftime: 1.1.1.2 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.0.25 cfgrib: None iris: 2.2.0 bottleneck: 1.3.2 dask: 2.9.0 distributed: 2.9.0 matplotlib: 3.1.2 cartopy: 0.17.0 seaborn: 0.9.0 numbagg: None setuptools: 42.0.2.post20191201 pip: 20.1 conda: None pytest: None IPython: 7.10.1 sphinx: None intake 0.5.3 intake-xarray 0.3.1 rasterio 1.0.25

pinging @scottyhq, @martindurant, @charlesbluca, @TomAugspurger and @rabernat for some help here. Thanks!

martindurant commented 4 years ago

That is a rather unpleasant traceback to read! I don't see where the path that rasterio is attempting to read comes from.

jhamman commented 4 years ago

@martindurant - drilling down a bit, here's a shorter example that eximplifies the problem:

with rasterio.open('gs://pangeo-ncar-soilgrids/TAXNWRB_250m.tif') as dataset:
    pass
---------------------------------------------------------------------------
CPLE_OpenFailedError                      Traceback (most recent call last)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

rasterio/_shim.pyx in rasterio._shim.open_dataset()

rasterio/_err.pyx in rasterio._err.exc_wrap_pointer()

CPLE_OpenFailedError: '/vsigs/pangeo-ncar-soilgrids/TAXNWRB_250m.tif' does not exist in the file system, and is not recognized as a supported dataset name.

During handling of the above exception, another exception occurred:

RasterioIOError                           Traceback (most recent call last)
<ipython-input-55-64c2b4a518c2> in <module>
----> 1 with rasterio.open('gs://pangeo-ncar-soilgrids/TAXNWRB_250m.tif') as dataset:
      2     pass

/srv/conda/envs/notebook/lib/python3.7/site-packages/rasterio/env.py in wrapper(*args, **kwds)
    443 
    444         with env_ctor(session=session):
--> 445             return f(*args, **kwds)
    446 
    447     return wrapper

/srv/conda/envs/notebook/lib/python3.7/site-packages/rasterio/__init__.py in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, **kwargs)
    214         # None.
    215         if mode == 'r':
--> 216             s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
    217         elif mode == 'r+':
    218             s = get_writer_for_path(path)(path, mode, driver=driver, sharing=sharing, **kwargs)

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: '/vsigs/pangeo-ncar-soilgrids/TAXNWRB_250m.tif' does not exist in the file system, and is not recognized as a supported dataset name.
martindurant commented 4 years ago

I don't at all know how rasterio works... but where does the "/vsigs/" path come from, is this some symbolic link embedded in the file? Is the "gs" part of that path significant? I notice that the DatasetReader is cython, so it's possible that you just can't pass arbitrary file-like objects, although I thought that was a known and tested case.

scottyhq commented 4 years ago

I don't at all know how rasterio works... but where does the "/vsigs/" path come from, is this some symbolic link embedded in the file? Is the "gs" part of that path significant?

rasterio uses GDAL behind the scenes and automatically converts 'gs://' prefixes to '/vsigs', which are described here https://gdal.org/user/virtual_file_systems.html.

@jhamman - I'd suggest bypassing all the python pieces and confirm that you can list or read the file from the machine you are running with gsutil. On AWS you can check which credentials are being used with aws configure list, how about gsutil? Also on AWS, there is either a flag to command line utilities or an environment variable you can set for requester pays buckets (AWS_REQUEST_PAYER='requester')

jhamman commented 4 years ago

We can certainly read the data directly. Below are two examples that use gcsfs:

import gcsfs
import rasterio
import xarray as xr

fs = gcsfs.GCSFileSystem(token='cloud', project='pangeo-181919', requester_pays=True)

# Example 1: get the tif's bytes using fs.cat()
tif = fs.cat('pangeo-ncar-soilgrids/TAXNWRB_250m.tif')

# Example 2: use a file-like object from gcsfs
fobj = fs.open('pangeo-ncar-soilgrids/TAXNWRB_250m.tif')
with rasterio.open(fobj) as dataset:
    print(dataset.bounds)

Now, using a similiar approach with Xarray's rasterio backend does not work:

da = xr.open_rasterio(fobj)

Yields:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-4074e0459ea2> in <module>
----> 1 da = xr.open_rasterio(fobj)
      2 da

/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/backends/rasterio_.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock)
    247 
    248     # Get bands
--> 249     if riods.count < 1:
    250         raise ValueError("Unknown dims")
    251     coords["band"] = np.asarray(riods.indexes)

AttributeError: '_GeneratorContextManager' object has no attribute 'count'

Rasterio does not have an GCP equivalent to their AWS_REQUEST_PAYER option. @martindurant - can we make intake-xarray use gcsfs here? Currently, intake-xarray is passing the path onto xarray for gs paths: https://github.com/intake/intake-xarray/blob/bf98a3c69eea81be716b310e33aeefbf1a89b1d0/intake_xarray/raster.py#L77-L82

jhamman commented 4 years ago

fwiw, this does work:

with fs.open('pangeo-ncar-soilgrids/TAXNWRB_250m.tif') as fobj:
    with rasterio.open(fobj) as dataset:
        da = xr.open_rasterio(dataset)
        display(da)

Not pretty but it works.

martindurant commented 4 years ago

It is odd to me that xarray doesn't do exactly that internally, maybe it should. Yes, intake-xarray could be a place to get the right call, but it would of course be simpler if xarray handles the file object directly.

jhamman commented 4 years ago

@martindurant - I looked into this a bit. The problem really comes from rasterio:

https://github.com/mapbox/rasterio/blob/ce00d482faa24068f41164b79901eeec31777263/rasterio/__init__.py#L176-L186

The treatment of all file-like objects is to a) read the full file, and b) turn it into a local MemoryFile. AND then return a context manager (which is different that the rest of the return objects from rasterio.io).

@scottyhq - are you aware of any way to use rasterio with file-like objects without passing through this MemoryFile route?

scottyhq commented 4 years ago

Thanks for digging in @jhamman, this is pretty in the weeds and not something I've looked into. I encourage posting a question and example to https://rasterio.groups.io/g/main in order to poll collective wisdom on this one!

jhamman commented 4 years ago

I've just opened a PR in xarray that would support using file-like objects in xarray.open_rasterio. If that fix seems acceptable, we should move on to talking about how to get Intake-xarray to use gcsfs when loading data via rasterio's virtual file system fails.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.