Closed jhamman closed 3 years ago
That is a rather unpleasant traceback to read! I don't see where the path that rasterio is attempting to read comes from.
@martindurant - drilling down a bit, here's a shorter example that eximplifies the problem:
with rasterio.open('gs://pangeo-ncar-soilgrids/TAXNWRB_250m.tif') as dataset:
pass
---------------------------------------------------------------------------
CPLE_OpenFailedError Traceback (most recent call last)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()
rasterio/_shim.pyx in rasterio._shim.open_dataset()
rasterio/_err.pyx in rasterio._err.exc_wrap_pointer()
CPLE_OpenFailedError: '/vsigs/pangeo-ncar-soilgrids/TAXNWRB_250m.tif' does not exist in the file system, and is not recognized as a supported dataset name.
During handling of the above exception, another exception occurred:
RasterioIOError Traceback (most recent call last)
<ipython-input-55-64c2b4a518c2> in <module>
----> 1 with rasterio.open('gs://pangeo-ncar-soilgrids/TAXNWRB_250m.tif') as dataset:
2 pass
/srv/conda/envs/notebook/lib/python3.7/site-packages/rasterio/env.py in wrapper(*args, **kwds)
443
444 with env_ctor(session=session):
--> 445 return f(*args, **kwds)
446
447 return wrapper
/srv/conda/envs/notebook/lib/python3.7/site-packages/rasterio/__init__.py in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, **kwargs)
214 # None.
215 if mode == 'r':
--> 216 s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
217 elif mode == 'r+':
218 s = get_writer_for_path(path)(path, mode, driver=driver, sharing=sharing, **kwargs)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()
RasterioIOError: '/vsigs/pangeo-ncar-soilgrids/TAXNWRB_250m.tif' does not exist in the file system, and is not recognized as a supported dataset name.
I don't at all know how rasterio works... but where does the "/vsigs/" path come from, is this some symbolic link embedded in the file? Is the "gs" part of that path significant? I notice that the DatasetReader is cython, so it's possible that you just can't pass arbitrary file-like objects, although I thought that was a known and tested case.
I don't at all know how rasterio works... but where does the "/vsigs/" path come from, is this some symbolic link embedded in the file? Is the "gs" part of that path significant?
rasterio uses GDAL behind the scenes and automatically converts 'gs://' prefixes to '/vsigs', which are described here https://gdal.org/user/virtual_file_systems.html.
@jhamman - I'd suggest bypassing all the python pieces and confirm that you can list or read the file from the machine you are running with gsutil
. On AWS you can check which credentials are being used with aws configure list
, how about gsutil? Also on AWS, there is either a flag to command line utilities or an environment variable you can set for requester pays buckets (AWS_REQUEST_PAYER='requester')
We can certainly read the data directly. Below are two examples that use gcsfs:
import gcsfs
import rasterio
import xarray as xr
fs = gcsfs.GCSFileSystem(token='cloud', project='pangeo-181919', requester_pays=True)
# Example 1: get the tif's bytes using fs.cat()
tif = fs.cat('pangeo-ncar-soilgrids/TAXNWRB_250m.tif')
# Example 2: use a file-like object from gcsfs
fobj = fs.open('pangeo-ncar-soilgrids/TAXNWRB_250m.tif')
with rasterio.open(fobj) as dataset:
print(dataset.bounds)
Now, using a similiar approach with Xarray's rasterio backend does not work:
da = xr.open_rasterio(fobj)
Yields:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-8-4074e0459ea2> in <module>
----> 1 da = xr.open_rasterio(fobj)
2 da
/srv/conda/envs/notebook/lib/python3.7/site-packages/xarray/backends/rasterio_.py in open_rasterio(filename, parse_coordinates, chunks, cache, lock)
247
248 # Get bands
--> 249 if riods.count < 1:
250 raise ValueError("Unknown dims")
251 coords["band"] = np.asarray(riods.indexes)
AttributeError: '_GeneratorContextManager' object has no attribute 'count'
Rasterio does not have an GCP equivalent to their AWS_REQUEST_PAYER
option. @martindurant - can we make intake-xarray use gcsfs
here? Currently, intake-xarray is passing the path onto xarray for gs
paths: https://github.com/intake/intake-xarray/blob/bf98a3c69eea81be716b310e33aeefbf1a89b1d0/intake_xarray/raster.py#L77-L82
fwiw, this does work:
with fs.open('pangeo-ncar-soilgrids/TAXNWRB_250m.tif') as fobj:
with rasterio.open(fobj) as dataset:
da = xr.open_rasterio(dataset)
display(da)
Not pretty but it works.
It is odd to me that xarray doesn't do exactly that internally, maybe it should. Yes, intake-xarray could be a place to get the right call, but it would of course be simpler if xarray handles the file object directly.
@martindurant - I looked into this a bit. The problem really comes from rasterio:
The treatment of all file-like objects is to a) read the full file, and b) turn it into a local MemoryFile. AND then return a context manager (which is different that the rest of the return objects from rasterio.io).
@scottyhq - are you aware of any way to use rasterio with file-like objects without passing through this MemoryFile route?
Thanks for digging in @jhamman, this is pretty in the weeds and not something I've looked into. I encourage posting a question and example to https://rasterio.groups.io/g/main in order to poll collective wisdom on this one!
I've just opened a PR in xarray that would support using file-like objects in xarray.open_rasterio
. If that fix seems acceptable, we should move on to talking about how to get Intake-xarray to use gcsfs when loading data via rasterio's virtual file system fails.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.
I am trying to open a cloud optimized geotiff in one of Pangeo's GCS buckets. Opening this dataset worked in the past but has stopped working. One thing to consider is that the bucket was recently (#95) moved to a requestor pays bucket. I have confirmed the dataset in question is present in gcs.
The following snippet of code once ran on ocean.pangeo.io, but now returns the error below.
MCVE Code Sample
returns
Version info
(All running on ocean.pangeo.io,
JUPYTER_IMAGE_SPEC=us.gcr.io/pangeo-181919/ocean-pangeo-io-notebook:b36b8b7
)pinging @scottyhq, @martindurant, @charlesbluca, @TomAugspurger and @rabernat for some help here. Thanks!