intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
74 stars 36 forks source link

Subsetting before Caching #47

Open rmg55 opened 5 years ago

rmg55 commented 5 years ago

Question: I am wondering if it is possible to subset a dataset (via .sel method) before the data is cached.

Reasoning: My use case is - I would like to cache all the landsat8 data from the s3 repository for a small research (~10km * 10km) station. Currently my catalog looks like (note this subsets the landsat8 tiffs to 60 total - but eventually I would want to use the entire timeseries):

plugins:
  source:
    - module: intake_xarray
sources:
  landsat8:
    description: Geotiff image of Landsat8 - TESTING.
    driver: rasterio
    cache:
      - argkey: urlpath
        regex: 'landsat-pds/c1/L8/033/031/'
        type: file
    args:
      urlpath: 's3://landsat-pds/c1/L8/033/031/LC08_L1TP_033031_{collection_date:%Y%m%d}_20170310_01_T1/LC08_L1TP_033031_{collection_date:%Y%m%d}_20170310_01_T1_B{band:d}.TIF'
      chunks:
        band: 1
        x: 1000
        y: 1000
      concat_dim: band
      storage_options: {'anon': True}

Being able to subset before caching would reduce the amount of storage significantly (see example below for my 60 tiff subset):

c = intake.open_catalog('test.yml')
l8 = c.landsat8
l8_r = l8.read_chunked()
l8_r.data.nbytes*1e-9
133.64736048

l8_r_sm = l8_r.sel(x=slice(minx,maxx),y=slice(miny,maxy))
l8_r_sm.nbytes*1e-9
0.19845552000000002

Is there currently a way of doing this? If not, how difficult would it be to add this functionality? Is it adding an extra argument to the intake-xarray driver (which would look like: slice: {x:(xmin,xmax),y:(ymin,ymax)} in the catalog) or would it need to include modifications to the caching mechanisms deeper in intake?

Thanks!

martindurant commented 5 years ago

Sorry for not replying sooner. The current caching mechanism does not cover this case, it operates before handing the paths to xarray and gets all or none of the files. One options would be to use the experimental caching file system, which keeps local copies of only files which are actually accessed. and only those parts which were accessed. I am not sure if loading to xarray involves touching all of the files, though. Something similar and custom could be written with a little effort; but more typical in the Intake context is to write a driver which handles the subset as an extra parameter to produce a file-list and xarray of only the files of interest.

rmg55 commented 5 years ago

Martin - thanks for your input!

In my use case, the region of interest is much smaller (geographically) than a single geotiff image. Therefore, I am unable to subset based on a file-list. Please let me know if I am misinterpreting your suggestion.

The experimental CachingFileSystem option would be excellent assuming xarray does not trigger an entire file download... At the moment I am having an issue getting it to work with intake. Below is my current effort with traceback... will the fsspec class be able to "drop-in" as a new caching mechanism?

import intake
from fsspec.implementations import cached
CachingFileSystem = cached.CachingFileSystem

intake.source.cache.registry['file_system']=CachingFileSystem
intake.source.cache.registry

{'file': intake.source.cache.FileCache, 'dir': intake.source.cache.DirCache, 'compressed': intake.source.cache.CompressedCache, 'dat': intake.source.cache.DATCache, 'file_system': fsspec.implementations.cached.CachingFileSystem}

c = intake.open_catalog('test.yml')
l8 = c.landsat8
l8 = l8.to_dask()

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-655b61aa6a66> in <module>
      1 l8 = c.landsat8
----> 2 l8 = l8.to_dask()

/opt/conda/lib/python3.7/site-packages/intake/catalog/entry.py in __getattr__(self, attr)
    119             return self.__dict__[attr]
    120         else:
--> 121             return getattr(self._get_default_source(), attr)
    122 
    123     def __dir__(self):

/opt/conda/lib/python3.7/site-packages/intake/catalog/entry.py in _get_default_source(self)
     91         """Instantiate DataSource with default agruments"""
     92         if self._default_source is None:
---> 93             self._default_source = self()
     94         return self._default_source
     95 

/opt/conda/lib/python3.7/site-packages/intake/catalog/entry.py in __call__(self, persist, **kwargs)
     76             raise ValueError('Persist value (%s) not understood' % persist)
     77         persist = persist or self._pmode
---> 78         s = self.get(**kwargs)
     79         if s.has_been_persisted and persist is not 'never':
     80             s2 = s.get_persisted()

/opt/conda/lib/python3.7/site-packages/intake/catalog/local.py in get(self, **user_parameters)
    274         """Instantiate the DataSource for the given parameters"""
    275         plugin, open_args = self._create_open_args(user_parameters)
--> 276         data_source = plugin(**open_args)
    277         data_source.catalog_object = self._catalog
    278         data_source.name = self.name

/opt/conda/lib/python3.7/site-packages/intake_xarray/raster.py in __init__(self, urlpath, chunks, concat_dim, xarray_kwargs, metadata, path_as_pattern, **kwargs)
     51         self._kwargs = xarray_kwargs or {}
     52         self._ds = None
---> 53         super(RasterIOSource, self).__init__(metadata=metadata)
     54 
     55     def _open_files(self, files):

/opt/conda/lib/python3.7/site-packages/intake/source/base.py in __init__(self, metadata)
     79                                      catdir=self.metadata.get('catalog_dir',
     80                                                               None),
---> 81                                      storage_options=storage_options)
     82         self.datashape = None
     83         self.dtype = None

/opt/conda/lib/python3.7/site-packages/intake/source/cache.py in make_caches(driver, specs, catdir, cache_dir, storage_options)
    555         driver, spec, catdir=catdir, cache_dir=cache_dir,
    556         storage_options=storage_options)
--> 557         for spec in specs]

/opt/conda/lib/python3.7/site-packages/intake/source/cache.py in <listcomp>(.0)
    555         driver, spec, catdir=catdir, cache_dir=cache_dir,
    556         storage_options=storage_options)
--> 557         for spec in specs]

TypeError: __init__() got an unexpected keyword argument 'catdir'

Catalog:

plugins:
  source:
    - module: intake_xarray
sources:
  landsat8:
    description: Geotiff image of Landsat Surface Reflectance Level-2 Science Product L5.
    driver: rasterio
    cache:
      - argkey: urlpath
        regex: 'landsat-pds/c1/L8/033/031/'
        type: file_system
    args:
      urlpath: 's3://landsat-pds/c1/L8/033/031/LC08_L1TP_033031_{collection_date:%Y%m%d}_20170310_01_T1/LC08_L1TP_033031_{collection_date:%Y%m%d}_20170310_01_T1_B{band:d}.TIF'
      chunks:
        band: 1
        x: 1000
        y: 1000
      concat_dim: band
      storage_options: {'anon': True}
martindurant commented 5 years ago

The CachingFileSystem is not an Intake cache implementation, but lower-level, providing python file-like objects. It could be made into a type of cache for Intake too, I hadn't even got that far. Unfortunately, rasterio can't read from python file-like objects, it needs a real, local file - which is exactly why the Intake caching scheme was designed to download whole files.

At the moment, I think the only solution would be to use this filesystem implementation, or s3 directly, via FUSE - which is usually painful and slower than you might think. Maybe worth a go, though?

It does give me idea: https://github.com/intake/filesystem_spec/issues/102 , which would not be too hard, I think.

aaronspring commented 2 years ago

I think for zarr only the chunks requested are actually cached. For nc files the whole file is cached. @observingClouds