intake / intake_geopandas

An intake plugin for loading datasets with geopandas
BSD 2-Clause "Simplified" License
15 stars 7 forks source link

cache shapefiles #15

Closed aaronspring closed 3 years ago

aaronspring commented 4 years ago

I would like to cache shapefiles to disk (also because I dont have internet access within jupyter notebooks). I was hoping to get the same caching functionality as with intake-xarray (https://github.com/intake/intake-xarray/blob/f4e6211d5c402256cd1036a5f490ae9a00a61832/intake_xarray/netcdf.py)

Here my catalog:

plugins:
  source:
    - module: intake_geopandas
sources:
  TNC:
    description: TNC
    driver: intake_geopandas.geopandas.ShapefileSource
    args:
      urlpath: http://maps.tnc.org/files/shp/terr-ecoregions-TNC.zip
    cache:
      - type: file
        argkey: urlpath

  MEOW:
    description: MEOW
    driver: intake_geopandas.geopandas.ShapefileSource
    args:
      urlpath: http://maps.tnc.org/files/shp/MEOW-TNC.zip
    cache:
      - type: file
        argkey: urlpath

But it never stores data to disk.

Then I found this hacky way to cache data to disk:

shp_cat.MEOW.cache[0].load('http://maps.tnc.org/files/shp/MEOW-TNC.zip')
['/work/mh0727/m300524/intake/cache/4c41e317f0efd49e5a79df4a286585dc/maps.tnc.org/files/shp/MEOW-TNC.zip']

But even though data is correctly downloaded and present in ~/.intake/cache_metadata.json, intake tries to get the data from the URL (hence the error) and not from the cached folder:

import intake
intake.config.conf['cache_dir'] = '/work/mh0727/m300524/intake/cache'
shp_cat = intake.open_catalog('~/remote_shapefiles.yml')
shp = shp_cat.MEOW.read()
---------------------------------------------------------------------------
OSError 
Connected refused

Interestingly inside the catalog item cache is moved into metadata:

>>> shp_cat.TNC
sources:
  TNC:
    args:
      urlpath: http://maps.tnc.org/files/shp/terr-ecoregions-TNC.zip
    description: TNC
    driver: intake_geopandas.geopandas.ShapefileSource
    metadata:
      cache:
      - argkey: urlpath
        type: file

Is this only a catalog problem with cache wrongly assigned to metadata? Why is cache under metadata? @jacobtomlinson @martindurant

Or do we need to tell intake_geopandas/geopandas.py explicitly to search for cached data? I thought this was done upstream by intake. Or is this what intake-xarray achieved with if getattr(self, 'on_server', False): metadata['internal'] = serialize_zarr_ds(self._ds) (https://github.com/intake/intake-xarray/blob/f4e6211d5c402256cd1036a5f490ae9a00a61832/intake_xarray/base.py)? @jsignell

martindurant commented 4 years ago

Instead of using a "cache:" block, does it work if you use URLs like

urlpath: "simplecache::http://maps.tnc.org/files/shp/terr-ecoregions-TNC.zip"
aaronspring commented 4 years ago

with

  MEOW:
    description: MEOW
    driver: intake_geopandas.geopandas.ShapefileSource
    args:
        urlpath: "simplecache::http://maps.tnc.org/files/shp/MEOW-TNC.zip"
    #cache:
    #  - type: file
    #    argkey: urlpath
>>> shp_cat = intake.open_catalog('~/remote_shapefiles.yml')
>>> shp_cat.MEOW.read()
Traceback (most recent call last):
  File "fiona/_shim.pyx", line 82, in fiona._shim.gdal_open_vector
  File "fiona/_err.pyx", line 270, in fiona._err.exc_wrap_pointer
fiona._err.CPLE_OpenFailedError: simplecache::http://maps.tnc.org/files/shp/MEOW-TNC.zip: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/intake_geopandas/geopandas.py", line 42, in read
  File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/intake_geopandas/geopandas.py", line 27, in _get_schema
  File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/intake_geopandas/geopandas.py", line 81, in _open_dataset
  File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/geopandas/io/file.py", line 89, in read_file
    with reader(path_or_bytes, **kwargs) as features:
  File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/fiona/env.py", line 397, in wrapper
    return f(*args, **kwargs)
  File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/fiona/__init__.py", line 253, in open
    layer=layer, enabled_drivers=enabled_drivers, **kwargs)
  File "/work/mh0727/m300524/miniconda3/envs/pymistral/lib/python3.7/site-packages/fiona/collection.py", line 159, in __init__
    self.session.start(self, **kwargs)
  File "fiona/ogrext.pyx", line 484, in fiona.ogrext.Session.start
  File "fiona/_shim.pyx", line 89, in fiona._shim.gdal_open_vector
fiona.errors.DriverError: simplecache::http://maps.tnc.org/files/shp/MEOW-TNC.zip: No such file or directory
martindurant commented 4 years ago

I see, so fiona doesn't work with arbitrary file-like objects...

So, the code of intake_geopandas could be changed to call fsspec.open_local to resolve caching and use the URL that I suggested. This is probably the place where ay change should go.

For the "old" kind of caching that you are trying to use, the driver also needs to be explicitly set up to call the caching machinery, which basically means calling load as you are trying to do and make use of the return value. This is one of the reasons that I would prefer implementing the previous open_local-based method. The "cache:" blocks have served to confuse, and I should mark it as deprecated.

aaronspring commented 4 years ago

I like this old caching as it seems easy to use. I didnt get simplecache:: running with either intake-xarray or intake-geopandas. I havent seen a working example for simplecache. If this is nicer, I would be happy to adopt.

If the old caching doesnt get disfunctional, I would try to implement as you suggested.

ian-r-rose commented 4 years ago

I would love for this project to get caching working. The tricky thing is that fiona/GDAL has its own virtual filesystem and URL format that is incompatible with that of fsspec. So a zip file in S3 is something like /vsis3/vsizip/path/to.shp using GDAL, and zip+s3://path/to.shp in fiona. Composing that with the fsspec virtual filesystem is pretty tricky to do, and stymied my initial ill-formed thoughts about how to do that in #10 .

Another wrinkle is that the most recent version of GeoPandas added experimental support for storing vector geodata in parquet files, and that functionality does not use GDAL VSI system. It would be cool to add that here, and I think that has an easier path forwards to using intake-style caching.

I see, so fiona doesn't work with arbitrary file-like objects...

GeoPandas sort-of works with file-like objects, but I don't think it is duck-typed enough to play nicely with fsspec.

So, the code of intake_geopandas could be changed to call fsspec.open_local to resolve caching and use the URL that I suggested. This is probably the place where ay change should go.

The main question in my mind is whether it is at possible to tokenize the URL so that the fsspec-relevant caching parts are resolved, but the GDAL-relevant fiona parts are left untouched. I wonder whether at that point the URL simply will have too much semantic content to be wieldy. Perhaps a better approach would be to add another caching set of args to the driver?

For the "old" kind of caching that you are trying to use, the driver also needs to be explicitly set up to call the caching machinery, which basically means calling load as you are trying to do and make use of the return value.

@martindurant is that not also somewhat true of the new-style? That is, the drivers need to explicitly opt-in to using fsspec in order to get the caching ability.

martindurant commented 4 years ago

@martindurant is that not also somewhat true of the new-style? That is, the drivers need to explicitly opt-in to using fsspec in order to get the caching ability.

It is true for both, yes, although I feel that fsspec.open (or fsspec.open_local) is much easier to use than the intake cache mechanism, and is not specific to intake, so should get much more usage.

ian-r-rose commented 4 years ago

@martindurant is that not also somewhat true of the new-style? That is, the drivers need to explicitly opt-in to using fsspec in order to get the caching ability.

It is true for both, yes, although I feel that fsspec.open (or fsspec.open_local) is much easier to use than the intake cache mechanism, and is not specific to intake, so should get much more usage.

Yeah, it certainly seems much easier to use from my perspective. I do think that in this case the upstream path-handling may require a different treatment here.

martindurant commented 4 years ago

You can't have both local caching and use of the GDAL URL compounding at the same time, right? We could ask fsspec to open_local (which means, open local file, or cache and open if the URL is remote-with-caching), and pass to GDAL if it fails. The current error when doing open_local on a pure remote URL is not particularly clear...

ian-r-rose commented 4 years ago

You can't have both local caching and use of the GDAL URL compounding at the same time, right?

Maybe not impossible, but I think that it would be painful to try to support that.

We could ask fsspec to open_local (which means, open local file, or cache and open if the URL is remote-with-caching), and pass to GDAL if it fails. The current error when doing open_local on a pure remote URL is not particularly clear... .

Interesting, I think that could be a reasonable way forward, though could be confusing to users. In that case simplecache::s3://... would delegate to fsspec, and s3://... would delegate to GDAL, with possibly different edge cases/failure modes.

martindurant commented 4 years ago

could be confusing to users

I can see that. You could have a use_fsspec kwarg or something

ian-r-rose commented 4 years ago

I can see that. You could have a use_fsspec kwarg or something

Yeah, I think something like that would be better than some sort of (probably leaky) automagic delegation logic.