Closed aaronspring closed 3 years ago
Instead of using a "cache:" block, does it work if you use URLs like
urlpath: "simplecache::http://maps.tnc.org/files/shp/terr-ecoregions-TNC.zip"
with
MEOW:
description: MEOW
driver: intake_geopandas.geopandas.ShapefileSource
args:
urlpath: "simplecache::http://maps.tnc.org/files/shp/MEOW-TNC.zip"
#cache:
# - type: file
# argkey: urlpath
>>> shp_cat = intake.open_catalog('~/remote_shapefiles.yml')
>>> shp_cat.MEOW.read()
Traceback (most recent call last):
File "fiona/_shim.pyx", line 82, in fiona._shim.gdal_open_vector
File "fiona/_err.pyx", line 270, in fiona._err.exc_wrap_pointer
fiona._err.CPLE_OpenFailedError: simplecache::http://maps.tnc.org/files/shp/MEOW-TNC.zip: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/intake_geopandas/geopandas.py", line 42, in read
File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/intake_geopandas/geopandas.py", line 27, in _get_schema
File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/intake_geopandas/geopandas.py", line 81, in _open_dataset
File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/geopandas/io/file.py", line 89, in read_file
with reader(path_or_bytes, **kwargs) as features:
File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/fiona/env.py", line 397, in wrapper
return f(*args, **kwargs)
File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/fiona/__init__.py", line 253, in open
layer=layer, enabled_drivers=enabled_drivers, **kwargs)
File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/fiona/collection.py", line 159, in __init__
self.session.start(self, **kwargs)
File "fiona/ogrext.pyx", line 484, in fiona.ogrext.Session.start
File "fiona/_shim.pyx", line 89, in fiona._shim.gdal_open_vector
fiona.errors.DriverError: simplecache::http://maps.tnc.org/files/shp/MEOW-TNC.zip: No such file or directory
I see, so fiona doesn't work with arbitrary file-like objects...
So, the code of intake_geopandas could be changed to call fsspec.open_local to resolve caching and use the URL that I suggested. This is probably the place where ay change should go.
For the "old" kind of caching that you are trying to use, the driver also needs to be explicitly set up to call the caching machinery, which basically means calling load
as you are trying to do and make use of the return value. This is one of the reasons that I would prefer implementing the previous open_local-based method. The "cache:" blocks have served to confuse, and I should mark it as deprecated.
I like this old caching as it seems easy to use.
I didnt get simplecache::
running with either intake-xarray
or intake-geopandas
. I havent seen a working example for simplecache
. If this is nicer, I would be happy to adopt.
If the old caching doesnt get disfunctional, I would try to implement as you suggested.
I would love for this project to get caching working. The tricky thing is that fiona/GDAL has its own virtual filesystem and URL format that is incompatible with that of fsspec. So a zip file in S3 is something like /vsis3/vsizip/path/to.shp
using GDAL, and zip+s3://path/to.shp
in fiona. Composing that with the fsspec virtual filesystem is pretty tricky to do, and stymied my initial ill-formed thoughts about how to do that in #10 .
Another wrinkle is that the most recent version of GeoPandas added experimental support for storing vector geodata in parquet files, and that functionality does not use GDAL VSI system. It would be cool to add that here, and I think that has an easier path forwards to using intake-style caching.
I see, so fiona doesn't work with arbitrary file-like objects...
GeoPandas sort-of works with file-like objects, but I don't think it is duck-typed enough to play nicely with fsspec.
So, the code of intake_geopandas could be changed to call fsspec.open_local to resolve caching and use the URL that I suggested. This is probably the place where ay change should go.
The main question in my mind is whether it is at possible to tokenize the URL so that the fsspec
-relevant caching parts are resolved, but the GDAL-relevant fiona parts are left untouched. I wonder whether at that point the URL simply will have too much semantic content to be wieldy. Perhaps a better approach would be to add another caching set of args to the driver?
For the "old" kind of caching that you are trying to use, the driver also needs to be explicitly set up to call the caching machinery, which basically means calling
load
as you are trying to do and make use of the return value.
@martindurant is that not also somewhat true of the new-style? That is, the drivers need to explicitly opt-in to using fsspec
in order to get the caching ability.
@martindurant is that not also somewhat true of the new-style? That is, the drivers need to explicitly opt-in to using fsspec in order to get the caching ability.
It is true for both, yes, although I feel that fsspec.open
(or fsspec.open_local
) is much easier to use than the intake cache mechanism, and is not specific to intake, so should get much more usage.
@martindurant is that not also somewhat true of the new-style? That is, the drivers need to explicitly opt-in to using fsspec in order to get the caching ability.
It is true for both, yes, although I feel that
fsspec.open
(orfsspec.open_local
) is much easier to use than the intake cache mechanism, and is not specific to intake, so should get much more usage.
Yeah, it certainly seems much easier to use from my perspective. I do think that in this case the upstream path-handling may require a different treatment here.
You can't have both local caching and use of the GDAL URL compounding at the same time, right? We could ask fsspec to open_local (which means, open local file, or cache and open if the URL is remote-with-caching), and pass to GDAL if it fails. The current error when doing open_local on a pure remote URL is not particularly clear...
You can't have both local caching and use of the GDAL URL compounding at the same time, right?
Maybe not impossible, but I think that it would be painful to try to support that.
We could ask fsspec to open_local (which means, open local file, or cache and open if the URL is remote-with-caching), and pass to GDAL if it fails. The current error when doing open_local on a pure remote URL is not particularly clear... .
Interesting, I think that could be a reasonable way forward, though could be confusing to users. In that case simplecache::s3://...
would delegate to fsspec, and s3://...
would delegate to GDAL, with possibly different edge cases/failure modes.
could be confusing to users
I can see that. You could have a use_fsspec
kwarg or something
I can see that. You could have a
use_fsspec
kwarg or something
Yeah, I think something like that would be better than some sort of (probably leaky) automagic delegation logic.
I would like to cache shapefiles to disk (also because I dont have internet access within jupyter notebooks). I was hoping to get the same caching functionality as with
intake-xarray
(https://github.com/intake/intake-xarray/blob/f4e6211d5c402256cd1036a5f490ae9a00a61832/intake_xarray/netcdf.py)Here my catalog:
But it never stores data to disk.
Then I found this hacky way to cache data to disk:
But even though data is correctly downloaded and present in
~/.intake/cache_metadata.json
, intake tries to get the data from the URL (hence the error) and not from the cached folder:Interestingly inside the catalog item
cache
is moved intometadata
:Is this only a catalog problem with cache wrongly assigned to
metadata
? Why iscache
undermetadata
? @jacobtomlinson @martindurantOr do we need to tell
intake_geopandas/geopandas.py
explicitly to search for cached data? I thought this was done upstream byintake
. Or is this whatintake-xarray
achieved withif getattr(self, 'on_server', False): metadata['internal'] = serialize_zarr_ds(self._ds)
(https://github.com/intake/intake-xarray/blob/f4e6211d5c402256cd1036a5f490ae9a00a61832/intake_xarray/base.py)? @jsignell