intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
74 stars 36 forks source link

open_local can only be used on a filesystem which has attribute local_file=True #81

Closed scottyhq closed 3 years ago

scottyhq commented 3 years ago

referencing a failing test in intake-stac with intake and intake-xarray dependencies from master https://github.com/intake/intake-stac/pull/61

Note that test passes with intake=0.6.0 and intake-xarray=0.3.1

Comment from @martindurant :

This is probably due to the release of intake-xarray; please forward or reference this issue there. I think it probably has to do with the distinction between file-based and URL-based open functions.

2020-10-01T16:10:36.0312034Z =================================== FAILURES ===================================
2020-10-01T16:10:36.0312414Z ____________________________ test_cat_item_stacking ____________________________
2020-10-01T16:10:36.0312697Z 
2020-10-01T16:10:36.0312953Z stac_item_obj = LC08_L1TP_152038_20200611_20200611_01_RT
2020-10-01T16:10:36.0313232Z 
2020-10-01T16:10:36.0313572Z     def test_cat_item_stacking(stac_item_obj):
2020-10-01T16:10:36.0314028Z         item = StacItem(stac_item_obj)
2020-10-01T16:10:36.0317370Z         list_of_bands = ['B1', 'B2']
2020-10-01T16:10:36.0317736Z         new_entry = item.stack_bands(list_of_bands)
2020-10-01T16:10:36.0322624Z         assert isinstance(new_entry, StacEntry)
2020-10-01T16:10:36.0323639Z         assert new_entry._description == 'B1, B2'
2020-10-01T16:10:36.0324260Z         assert new_entry.name == 'B1_B2'
2020-10-01T16:10:36.0324641Z >       new_da = new_entry().to_dask()
2020-10-01T16:10:36.0324842Z 
2020-10-01T16:10:36.0325186Z intake_stac/tests/test_catalog.py:132: 
2020-10-01T16:10:36.0325607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2020-10-01T16:10:36.0326375Z /usr/share/miniconda/envs/test_env/lib/python3.7/site-packages/intake_xarray/base.py:69: in to_dask
2020-10-01T16:10:36.0326994Z     return self.read_chunked()
2020-10-01T16:10:36.0327990Z /usr/share/miniconda/envs/test_env/lib/python3.7/site-packages/intake_xarray/base.py:44: in read_chunked
2020-10-01T16:10:36.0328629Z     self._load_metadata()
2020-10-01T16:10:36.0329470Z /usr/share/miniconda/envs/test_env/lib/python3.7/site-packages/intake/source/base.py:235: in _load_metadata
2020-10-01T16:10:36.0330117Z     self._schema = self._get_schema()
2020-10-01T16:10:36.0330906Z /usr/share/miniconda/envs/test_env/lib/python3.7/site-packages/intake_xarray/raster.py:93: in _get_schema
2020-10-01T16:10:36.0331526Z     self._open_dataset()
2020-10-01T16:10:36.0332378Z /usr/share/miniconda/envs/test_env/lib/python3.7/site-packages/intake_xarray/raster.py:77: in _open_dataset
2020-10-01T16:10:36.0333178Z     files = fsspec.open_local(self.urlpath, **self.storage_options)
2020-10-01T16:10:36.0333669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2020-10-01T16:10:36.0333904Z 
2020-10-01T16:10:36.0335083Z url = ['https://landsat-pds.s3.amazonaws.com/c1/L8/152/038/LC08_L1TP_152038_20200611_20200611_01_RT/LC08_L1TP_152038_2020061....amazonaws.com/c1/L8/152/038/LC08_L1TP_152038_20200611_20200611_01_RT/LC08_L1TP_152038_20200611_20200611_01_RT_B2.TIF']
2020-10-01T16:10:36.0336279Z mode = 'rb', storage_options = {}, of = <List of 2 OpenFile instances>
2020-10-01T16:10:36.0336651Z 
2020-10-01T16:10:36.0337016Z     def open_local(url, mode="rb", **storage_options):
2020-10-01T16:10:36.0337482Z         """Open file(s) which can be resolved to local
2020-10-01T16:10:36.0337987Z     
2020-10-01T16:10:36.0338344Z         For files which either are local, or get downloaded upon open
2020-10-01T16:10:36.0338805Z         (e.g., by file caching)
2020-10-01T16:10:36.0339111Z     
2020-10-01T16:10:36.0339386Z         Parameters
2020-10-01T16:10:36.0339806Z         ----------
2020-10-01T16:10:36.0340093Z         url: str or list(str)
2020-10-01T16:10:36.0340391Z         mode: str
2020-10-01T16:10:36.0340637Z             Must be read mode
2020-10-01T16:10:36.0340967Z         storage_options:
2020-10-01T16:10:36.0341438Z             passed on to FS for or used by open_files (e.g., compression)
2020-10-01T16:10:36.0341837Z         """
2020-10-01T16:10:36.0342121Z         if "r" not in mode:
2020-10-01T16:10:36.0342571Z             raise ValueError("Can only ensure local files when reading")
2020-10-01T16:10:36.0343122Z         of = open_files(url, mode=mode, **storage_options)
2020-10-01T16:10:36.0343584Z         if not getattr(of[0].fs, "local_file", False):
2020-10-01T16:10:36.0343935Z             raise ValueError(
2020-10-01T16:10:36.0344375Z >               "open_local can only be used on a filesystem which"
2020-10-01T16:10:36.0344843Z                 " has attribute local_file=True"
2020-10-01T16:10:36.0345185Z             )
2020-10-01T16:10:36.0345709Z E           ValueError: open_local can only be used on a filesystem which has attribute local_file=True
2020-10-01T16:10:36.0346127Z 
2020-10-01T16:10:36.0346865Z /usr/share/miniconda/envs/test_env/lib/python3.7/site-packages/fsspec/core.py:461: ValueError

Note sure where to best remedy this...

martindurant commented 3 years ago

I think https://github.com/intake/intake-xarray/pull/76 has the explanation - netCDF only deals with files (which must be local), and the OpenDAP driver handles URLs; previously the two were more confused. @d70-t , would you like to comment on this?

d70-t commented 3 years ago

Hi @martindurant and @scottyhq. Yes, the issue seems to be related to #76 but as far as I see, that is not because of the netcdf or opendap drivers. This failing test cases seems to open the image using the raster driver. However, #62 introduced caching not only for netcdf files but also for raster files.

The idea of separating opendap and netcdf which was part of the discussion of #76 was based on the assumption that remote files which are actually netcdf files should be cached locally while remote resources which are actually opendap resources should not be cached locally (in fact, those resources are not a single file, so downloading and caching a single URL will never work for opendap resources). Complicating the situation is the fact that URLs to plain netCDF-files are not distingushable from URLs to opendap resources and the standard netCDF library is also able to open both cases. That is why it is useful to distinguish those URLs on driver level.

I think the situation for the raster driver is different as to my knowledge, there is no direct equivalent to the netcdf/opendap duality*. So I would expect that fsspec would be able to do its caching logic on URLs pointing to a tif image as in your test. However, I have barely any knowledge about fsspec and what would be the right way of introducing caching with fsspec. Additionally, the fsspec doc is a bit sparse about what open_local actually does and what it returns. From skimming over it, I would have assumed that it downloads the file to some place on the local file system and returns a path to the downloaded file (but that may be wrong). In this case, xarray would open the local copy in stead of accessing the file via https. But @martindurant, it looks like you introduced the change, so maybe you can tell a little about what open_local actually does?

*It would be possible to create an opendap server which serves TIF files as an opendap resource (which is not the case here). But that situation would be handled properly in the current version by opening it with the opendap driver.

martindurant commented 3 years ago

You are right about your understanding of what open_local does, but it only works for a URL which specifies the caching mechanism to be used (typically by including "simplecache::" prefixed to the URL). Is it the case, then, that the driver can read directly from a python file-like object? read_local ought to be only used for the case that this doesn't work and you can only operate on file names.

scottyhq commented 3 years ago

Thanks for the input @d70-t and @martindurant !

Just to clarify we are opening a TIF file here ultimately via xarray.open_rasterio(), which handles both URLs and local paths. After glancing around current open issues, this strikes me as also related to https://github.com/intake/intake-xarray/issues/61 in terms of selecting the correct open() function.

I'm learning the plumbing as I go along here, but I'm actually a bit confused as to why fsspec is involved in this particular chain, because this works just fine without fsspec:

import xarray as xr
da = xr.open_rasterio('https://landsat-pds.s3.amazonaws.com/c1/L8/152/038/LC08_L1TP_152038_20200611_20200611_01_RT/LC08_L1TP_152038_20200611_20200611_01_RT_B1.TIF')

I noticed the full URLs didn't appear in the traceback, but they are public. Perhaps some flag or change needs to be added driver: rasterio to remedy this.

We definitely do not want to download the full TIF locally and then operate on it.

d70-t commented 3 years ago

I think the answer is sometimes... For rasterio, I don't know. For netcdf it would force xarray.open_dataset to use a different backend driver. I think the xarray open_* functions are built such that they can typically open either local files by filename or remote things by URL.

But apart from that, if open_local is only to be used if the url specifies a caching mechanism (why is that actually the case? shouldn't the user specify how caching on their machine works and not the catalog?), then probably both, the netcdf-driver and the raster-driver should only use open_local if such a caching mechanism is specified in the URL and otherwise pass the URL down to the backend driver.

martindurant commented 3 years ago

Yeah, so there are three paths, and I don't really know which driver supports which:

These could be options, but it seems like intake-xarray should know which ones are possible in each case and have sensible defaults.

scottyhq commented 3 years ago

pass through the original URL to the loader

I can't claim to understand all the consequences, but this seems like a safe bet for driver: rasterio. Should I open a PR to modify this line: https://github.com/intake/intake-xarray/blob/a654542c03fd5238253672dd0d54afe40b53835e/intake_xarray/raster.py#L77

to be simply files = self.urlpath ? Seems like that might be all that is needed to pass through directly to xr.open_rasterio (whether local path, url, or even file-like object)...

martindurant commented 3 years ago

Yes... except that that would break the case that someone does want caching. How would you know the difference? You could see whether fsspec would resolve the URL into a local-allowed instance or perhaps more simply, special case http(s) for passthrough and still call fsspec otherwise.

(self.urlpath will never be a file-like, since you wouldn't be able to put that in a catalog, so we still have ambiguity between creating a file-like, or downloading and passing a local path)

scottyhq commented 3 years ago

right, i'm also confused as to what happens when multiple libraries in the dependency chain do their own caching (fsspec and rasterio/GDAL under the hood).

This open PR on xarray is also relevant https://github.com/pydata/xarray/pull/4140, maybe @jhamman has some thoughts on the best way to proceed?

d70-t commented 3 years ago

How would you know the difference?

@martindurant if I understood you correctly, caching as it is currently implemented only works if the urlpath is prefixed by a cache-specification (e.g. "simplecache::"). Doesn't this qualify as a way of distinguishing if someone wants the thing to be cached or not? - and could this be used to decide if open_local should be used at all?

On the other hand, is it really useful to do whole-file-caching? I could imagine many cases where the user only wants to access parts of the original dataset. In these cases, caching can introduce a lot more traffic on the network than actually needed. E.g. the netCDF4-library has a mechanism of generating HTTP range requests and I could imagine that rasterio has something similar as well. I see that fsspec file-like objects are supposed to do something similar, but maybe the backend libraries can do that in more clever ways and sometimes it might be difficult to convince these backend libraries to use file-like objects.

Probably out of scope of this issue, but is there a good way of supporting python file-like objects in cases where the file is actually opened by an underlying c-library?

martindurant commented 3 years ago

I'll answer this part quickly:

Probably out of scope of this issue, but is there a good way of supporting python file-like objects in cases where the file is actually opened by an underlying c-library?

No. C libraries could choose to accept python file-like objects only by linking against the python runtime (I think h5py can do this), so it's far from typical. You could use FUSE on posix, but that's a whole other bag of worms.

martindurant commented 3 years ago

caching only works if the urlpath is prefixed by a cache-specification

Yes; except that other implementations could well appear. The class attribute local_file = True appears on classes that can give local file-names.

(https://github.com/intake/filesystem_spec/issues/438 for another thought on caching, perhaps not too useful in this discussion)

d70-t commented 3 years ago

Hmm, I think I am slowly getting closer to how fsspec should be thought about... Still, probably it is a bad idea to generally cache everything which could be accessed via HTTP. I'd guess that if I had to order people by their qualification to decide about caching, the user should be the most qualified, the catalog author should be second and the library author would be third. In that sense, it would be better to decide on caching in the catalog then doing it completely automatically.

Would it be an option to ask fsspec if it can provide a local name and just pass on the URL to the backend driver if it can't? In this case, caching maybe would be enabled if it doesn't hurt and in other cases, the catalog author could opt-in by prefixing simplecache? One such case would be that the server is serving very small files such that subsetting them isn't useful in almost every case.

martindurant commented 3 years ago

This does miss the file-object case, though, which might be important for reading only part of a remote file, from a resource that the loader doesn't know how to ask for ranges on (i.e., anything other than HTTP).

To tell whether you can expect to get a local file:

getattr(fsspec.core.url_to_fs(path)[0], "local_file", False)

but this does instantiate the filesystem instance.

(perhaps local_file should be defined as False on the superclass)

d70-t commented 3 years ago

Is it a bad thing to instantiate the filessytem instance?

Maybe the procedure should then be (for the raster and netcdf drivers):

  1. check if the backend is able to handle a file-like object and does not do more clever requests if it gets an URL (probably do this check by reading the docs), if yes -> open the file with fsspec and pass it to the backend
  2. check if the filesystem supports the creation of a local file, if yes -> do that and open it with open_local and pass it to the backend
  3. if the backend supports opening of URLs, pass the url to the backend
  4. fail

Currently, I think the answer to 1 is no for the current implementations of the raster and netcdf drivers. For 2, the answer depends on the filesystem, and maybe on the catalog author (i.e. simplecache...). 3 would currently be true for both raster and netcdf.

In future, 1 may be the toughest one to decide though. It is probable that the backend can do more clever requests but is not able to do caching, which then might be less clever.

Probably 3 and 4 could be migrated into one as the backend will fail anyways if it is not able to handle the input.

martindurant commented 3 years ago

Is it a bad thing to instantiate the filessytem instance?

It may cause network requests, to establish credentials for instance

getattr(fsspec.get_filesystem_class(re.split(r"(\:\:|\://)", path, 1)[0]), "local_file", False)

would not instantiate, only import - so that would be OK (but import would fail, if there are any URLs that a loader can handle which fsspec cannot, although there may be none of those).

martindurant commented 3 years ago

Ref: https://github.com/intake/filesystem_spec/pull/439

scottyhq commented 3 years ago

@martindurant If I understand correctly the change you referenced is a partial solution and we still need some changes either in intake-xarray or at the catalog level in intake-stac to remedy loading URLs with rasterio (https://github.com/intake/intake-stac/pull/61).

Is the following reasonable for raster.py? It would add the requirement fsspec>0.8.3 to intake-xarray

        if isinstance(self.urlpath, list):
            self._can_be_local = fsspec.utils.can_be_local(self.urlpath[0])
        else:
            self._can_be_local = fsspec.utils.can_be_local(self.urlpath)

       if self._can_be_local:
            files = fsspec.open_local(self.urlpath, **self.storage_options)
        else:
            files = self.urlpath

If so I can follow up with a PR, perhaps adding a test like:

url = 'https://landsat-pds.s3.amazonaws.com/c1/L8/152/038/LC08_L1TP_152038_20200611_20200611_01_RT/LC08_L1TP_152038_20200611_20200611_01_RT_B1.TIF'
source = intake.open_rasterio(url, chunks={})
da = source.to_dask()
assert isinstance(da, xrray.core.dataarray.DataArray)
martindurant commented 3 years ago

Yes, but only if the case where we pass a file-like object is not useful.

Note that it would be much better to have a test that didn't need the network, which would mean setting up a server process in the CI.

martindurant commented 3 years ago

but yes, let's set that PR up. The truth will be in the test cases! For a test server, you could use https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/tests/test_http.py#L72 , or its fixture (a few lines below), which you can just import if fsspec is a dep anyway.

d70-t commented 3 years ago

I think it is a good idea to add the same change also to netcdf.py and probably that should be included in the same PR. What do you think?

martindurant commented 3 years ago

Yes, please. We should have tests for all the cases... So we can use the local server for HTTP, and perhaps the memory: filesystem for file-like. Than we can find out for ourselves what the loaders can handle.

scottyhq commented 3 years ago

closed by #82 and #93