intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
74 stars 36 forks source link

Can't cache files #116

Closed rabernat closed 2 years ago

rabernat commented 2 years ago

I am trying to use intake-xarray to download a grib file and open it with xarray. Here is the catalog:

plugins:
  source:
    - module: intake_xarray
sources:
  sample_grib_data:
    description: Sample GRIB file
    driver: netcdf
    args:
      urlpath: 'https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib'
    cache:
      - argkey: urlpath
        type: file

And here is the code:

import intake
cat = intake.open_catalog("catalog.yaml")
cat.sample_grib_data.to_dask()

traceback:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/n8/63q49ms55wxcj_gfbtykwp5r0000gn/T/ipykernel_27912/2893924497.py in <module>
      1 import intake
      2 cat = intake.open_catalog("catalog.yaml")
----> 3 cat.sample_grib_data.to_dask()

/opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/intake_xarray/base.py in to_dask(self)
     67     def to_dask(self):
     68         """Return xarray object where variables are dask arrays"""
---> 69         return self.read_chunked()
     70 
     71     def close(self):

/opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/intake_xarray/base.py in read_chunked(self)
     42     def read_chunked(self):
     43         """Return xarray object (which will have chunks)"""
---> 44         self._load_metadata()
     45         return self._ds
     46 

/opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/intake/source/base.py in _load_metadata(self)
    234         """load metadata only if needed"""
    235         if self._schema is None:
--> 236             self._schema = self._get_schema()
    237             self.dtype = self._schema.dtype
    238             self.shape = self._schema.shape

/opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/intake_xarray/base.py in _get_schema(self)
     16 
     17         if self._ds is None:
---> 18             self._open_dataset()
     19 
     20             metadata = {

/opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/intake_xarray/netcdf.py in _open_dataset(self)
     88         else:
     89             # [https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918](https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918%3C/span%3E)
---> 90             url = fsspec.open(self.urlpath, **self.storage_options).open()
     91 
     92         self._ds = _open_dataset(url, chunks=self.chunks, **kwargs)

/opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/core.py in open(urlpath, mode, compression, encoding, errors, protocol, newline, **kwargs)
    460     ``OpenFile`` object.
    461     """
--> 462     return open_files(
    463         urlpath=[urlpath],
    464         mode=mode,

/opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/core.py in open_files(urlpath, mode, compression, encoding, errors, name_function, num, protocol, newline, auto_mkdir, expand, **kwargs)
    292     be used as a single context
    293     """
--> 294     fs, fs_token, paths = get_fs_token_paths(
    295         urlpath,
    296         mode,

/opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol, expand)
    632             cls = get_filesystem_class(protocol)
    633             optionss = list(map(cls._get_kwargs_from_urls, urlpath))
--> 634             paths = [cls._strip_protocol(u) for u in urlpath]
    635             options = optionss[0]
    636             if not all(o == options for o in optionss):

/opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/core.py in <listcomp>(.0)
    632             cls = get_filesystem_class(protocol)
    633             optionss = list(map(cls._get_kwargs_from_urls, urlpath))
--> 634             paths = [cls._strip_protocol(u) for u in urlpath]
    635             options = optionss[0]
    636             if not all(o == options for o in optionss):

/opt/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/implementations/local.py in _strip_protocol(cls, path)
    183     def _strip_protocol(cls, path):
    184         path = stringify_path(path)
--> 185         if path.startswith("file://"):
    186             path = path[7:]
    187         return make_path_posix(path).rstrip("/") or cls.root_marker

AttributeError: 'list' object has no attribute 'startswith'

So the error is ultimately being raised in fsspec. But I thought I would start here, since I don't fully understand the call stack here.

aaronspring commented 2 years ago

You try to use the old way of caching in intake.

Its recommended to use caching from fsspec: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally, i.e. specifying storage_options and append simplecache:: to the url. if you want the same name pass "cache_storage": folder and "same_names": True

plugins:
  source:
    - module: intake_xarray
sources:
  sample_grib_data:
    description: Sample GRIB file
    driver: netcdf
    args:
      urlpath: 'simplecache::https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib'
      xarray_kwargs:
        engine: cfgrib
import intake
cat = intake.open_catalog("catalog.yaml")
cat.sample_grib_data.to_dask()
rabernat commented 2 years ago

Thanks for the tip @aaronspring! I revised my catalog as follows and it worked:

plugins:
  source:
    - module: intake_xarray
sources:
  sample_grib_data:
    description: Sample GRIB file
    driver: netcdf
    args:
      urlpath: 'simplecache::https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib'
      xarray_kwargs:
        engine: cfgrib
rabernat commented 2 years ago

Also seems related to https://github.com/fsspec/filesystem_spec/issues/794.

aaronspring commented 2 years ago

Can we close this issue?

rabernat commented 2 years ago

Thanks for your reply Aaron. I was confused by the intake documentation, but I now see that the feature I was trying to use is deprecated.

I would consider adding a note to the intake-xarray docs explaining how to activate caching, perhaps linking to https://intake.readthedocs.io/en/latest/catalog.html#caching. (Overall I find the documentation to be lacking in terms of useful examples and code I can just copy and paste.)

Other than that suggestion, you may consider my issue resolved.

martindurant commented 2 years ago

The original Intake caching should still work - but it's of course not high priority. I don't immediately see anything wrong in how it was defined.