intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
76 stars 36 forks source link

Remote hdf5 file access over https #56

Closed weiji14 closed 5 years ago

weiji14 commented 5 years ago

This is just me trying to work out how to access HDF5 files over https unsuccessfully:

import intake
url = "https://gamma.hdfgroup.org/ftp/pub/outgoing/NASAHDF/ATL06_20190223232535_08780212_001_01.h5"
dataset = intake.open_netcdf(
    urlpath=url, xarray_kwargs={"engine": "h5netcdf"}
)
dataset.read()

Full JSONDecodeError message:

```python-traceback --------------------------------------------------------------------------- JSONDecodeError Traceback (most recent call last) in ----> 1 dataset.read() ~/.local/share/virtualenvs/condaenv-AbcDeF1z/src/intake-xarray/intake_xarray/base.py in read(self) 37 def read(self): 38 """Return a version of the xarray with all the data in memory""" ---> 39 self._load_metadata() 40 return self._ds.load() 41 ~/.local/share/virtualenvs/condaenv-AbcDeF1z/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self) 115 """load metadata only if needed""" 116 if self._schema is None: --> 117 self._schema = self._get_schema() 118 self.datashape = self._schema.datashape 119 self.dtype = self._schema.dtype ~/.local/share/virtualenvs/condaenv-AbcDeF1z/src/intake-xarray/intake_xarray/base.py in _get_schema(self) 16 17 if self._ds is None: ---> 18 self._open_dataset() 19 20 metadata = { ~/.local/share/virtualenvs/condaenv-AbcDeF1z/src/intake-xarray/intake_xarray/netcdf.py in _open_dataset(self) 58 _open_dataset = xr.open_dataset 59 ---> 60 self._ds = _open_dataset(url, chunks=self.chunks, **kwargs) 61 62 def _add_path_to_ds(self, ds): ~/.local/share/virtualenvs/condaenv-AbcDeF1z/lib/python3.7/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime) 533 534 with close_on_error(store): --> 535 ds = maybe_decode_store(store) 536 537 # Ensure source filename always stored in dataset object (GH issue #2550) ~/.local/share/virtualenvs/condaenv-AbcDeF1z/lib/python3.7/site-packages/xarray/backends/api.py in maybe_decode_store(store, lock) 448 decode_coords=decode_coords, 449 drop_variables=drop_variables, --> 450 use_cftime=use_cftime, 451 ) 452 ~/.local/share/virtualenvs/condaenv-AbcDeF1z/lib/python3.7/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime) 568 encoding = obj.encoding 569 elif isinstance(obj, AbstractDataStore): --> 570 vars, attrs = obj.load() 571 extra_coords = set() 572 file_obj = obj ~/.local/share/virtualenvs/condaenv-AbcDeF1z/lib/python3.7/site-packages/xarray/backends/common.py in load(self) 121 """ 122 variables = FrozenDict( --> 123 (_decode_variable_name(k), v) for k, v in self.get_variables().items() 124 ) 125 attributes = FrozenDict(self.get_attrs()) ~/.local/share/virtualenvs/condaenv-AbcDeF1z/lib/python3.7/site-packages/xarray/backends/h5netcdf_.py in get_variables(self) 154 def get_variables(self): 155 return FrozenDict( --> 156 (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items() 157 ) 158 ~/.local/share/virtualenvs/condaenv-AbcDeF1z/lib/python3.7/site-packages/xarray/backends/h5netcdf_.py in ds(self) 113 @property 114 def ds(self): --> 115 return self._acquire() 116 117 def open_store_variable(self, name, var): ~/.local/share/virtualenvs/condaenv-AbcDeF1z/lib/python3.7/site-packages/xarray/backends/h5netcdf_.py in _acquire(self, needs_lock) 105 106 def _acquire(self, needs_lock=True): --> 107 with self._manager.acquire_context(needs_lock) as root: 108 ds = _nc4_require_group( 109 root, self._group, self._mode, create_group=_h5netcdf_create_group ~/miniconda3/envs/condaenv/lib/python3.7/contextlib.py in __enter__(self) 110 del self.args, self.kwds, self.func 111 try: --> 112 return next(self.gen) 113 except StopIteration: 114 raise RuntimeError("generator didn't yield") from None ~/.local/share/virtualenvs/condaenv-AbcDeF1z/lib/python3.7/site-packages/xarray/backends/file_manager.py in acquire_context(self, needs_lock) 184 def acquire_context(self, needs_lock=True): 185 """Context manager for acquiring a file.""" --> 186 file, cached = self._acquire_with_cache_info(needs_lock) 187 try: 188 yield file ~/.local/share/virtualenvs/condaenv-AbcDeF1z/lib/python3.7/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 202 kwargs = kwargs.copy() 203 kwargs["mode"] = self._mode --> 204 file = self._opener(*self._args, **kwargs) 205 if self._mode == "w": 206 # ensure file doesn't get overriden when opened again ~/.local/share/virtualenvs/condaenv-AbcDeF1z/lib/python3.7/site-packages/h5netcdf/core.py in __init__(self, path, mode, invalid_netcdf, **kwargs) 588 "opening urls: {}".format(path)) 589 try: --> 590 with h5pyd.File(path, 'r') as f: # noqa 591 pass 592 self._preexisting_file = True ~/.local/share/virtualenvs/condaenv-AbcDeF1z/src/h5pyd/h5pyd/_hl/files.py in __init__(self, domain, mode, endpoint, username, password, api_key, use_session, use_cache, logger, owner, linked_domain, retries, **kwds) 186 187 if rsp.status_code == 200: --> 188 root_json = json.loads(rsp.text) 189 if rsp.status_code != 200 and mode in ('r', 'r+'): 190 # file must exist ~/miniconda3/envs/condaenv/lib/python3.7/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 346 parse_int is None and parse_float is None and 347 parse_constant is None and object_pairs_hook is None and not kw): --> 348 return _default_decoder.decode(s) 349 if cls is None: 350 cls = JSONDecoder ~/miniconda3/envs/condaenv/lib/python3.7/json/decoder.py in decode(self, s, _w) 335 336 """ --> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 338 end = _w(s, end).end() 339 if end != len(s): ~/miniconda3/envs/condaenv/lib/python3.7/json/decoder.py in raw_decode(self, s, idx) 353 obj, end = self.scan_once(s, idx) 354 except StopIteration as err: --> 355 raise JSONDecodeError("Expecting value", s, err.value) from None 356 return obj, end JSONDecodeError: Expecting value: line 7 column 1 (char 12) ```

My question is on whether it's feasible to have intake download the HDF5 file from the http URL, and persist it locally, since it's not able to stream it directly (at least not easily, see this blog post on the difficulties of accessing HDF in the cloud).

I actually encountered this problem in my work and saw it mentioned on a stackoverflow question and thought I'd ask. Granted, I'm not sure if this issue is even in the right place, should I move it upstream instead to intake, or h5netcdf where these lines are the key to the error? There's also the h5pyd library which seems to allow remote access of hdf5 files, if it is served over a HDF REST API interface.

Thoughts?

martindurant commented 5 years ago

if it is served over a HDF REST API interface

yes, it is possible to set of a hdf server specifically for a purpose like this, but of course that's only relevant if you control both sides of the communication. You could do the same and I would argue easier) with the Intake server too.

So, no I don't think hdf can be loaded from normal http. s3 support was added recently, and given that both s3 and http can be read by fsspec, it would not take much for that to be supported too.

In the meantime, you can indeed use Intake caching to download the file to local and read that, like here for some image data. These spec blocks are a little tricky to get right, see the Intake documentation.

It seems, unfortunately, that h5netcdf assumes you are trying to use h5pyd when you give a URL. We could plausibly change this, since caching at the file system layer is also now possible thanks to fsspec. What do you think, @jsignell ?

jsignell commented 5 years ago

I agree that in the short term either fsspec or intake-level caching would solve this issue by caching the whole file locally before trying to access it. So that seems like the best solution unless there is some strong need to only access part of the file.

martindurant commented 5 years ago

It may be worth asking on the h5py (or h5netcdf?) about the status of remote access. If they can do s3, why not other ones...

martindurant commented 5 years ago

Actually, the following totally works:

import fsspec
url = "https://gamma.hdfgroup.org/ftp/pub/outgoing/NASAHDF/ATL06_20190223232535_08780212_001_01.h5"
with fsspec.open(url) as f:
    ds = xr.open_dataset(f)

... so should the netCDF driver be changed to assume URLs are fsspec-openable things, rather than passing to xarray? Should we ask an XR person?

If you wanted FS-level caching on the above, you would do

import fsspec
url = "filecache://gamma.hdfgroup.org/ftp/pub/outgoing/NASAHDF/ATL06_20190223232535_08780212_001_01.h5"
with fsspec.open(url, target_protocol='http', cache_storage="/path/to/cache") as f:
    ds = xr.open_dataset(f)
weiji14 commented 5 years ago

Right, I definitely think using fsspec is the way to go since it's meant for reducing code duplication. That file-caching method looks really awesome! Now the big question is where to insert that piece of logic - here on intake-xarray or upstream at intake. Just trying to think of other file formats (besides hdf5) that might find this useful too which might help us decide where to put it.

martindurant commented 5 years ago

Most drivers for Intake already use fsspec, since they call python libraries which are happy with the python file-like interface. That did not include HDF5, but it seems it now does. @jhamman @rabernat , is that now generally true, that xarray happily takes file objects for the various backends? If yes, must any of them be specifically local (i.e., with an OS file handle) files?

weiji14 commented 5 years ago

Hmm, I just tried using the filecache://-based code block you mentioned above to download the file. The file actually downloads fine, but it emits this error:

ValueError: Got more bytes (60681386) than requested (0)

Not sure if it's just the example we're using, but looking at this, it seems the http headers isn't giving the right Content-Length for the HDF5 file.

weiji14 commented 5 years ago

... so should the netCDF driver be changed to assume URLs are fsspec-openable things, rather than passing to xarray? Should we ask an XR person?

@jhamman @rabernat , is that now generally true, that xarray happily takes file objects for the various backends? If yes, must any of them be specifically local (i.e., with an OS file handle) files?

A quick search for fsspec in the xarray code repository shows up nothing... I think they're still using gcsfs and s3 explicitly rather than through fsspec?

martindurant commented 5 years ago

A quick search for fsspec in the xarray code repository shows up nothing

That's not what I meant - we are passing a file-like object here, and I'm wondering what assumptions are made about it within xarray and the libraries it calls. Not long ago, it used to extract the path or file handle and load that in the C code, which would of course not work for something remote. I believe it may now be checking explicitly for s3 and http paths and handling them (instead of using the object directly), but I'm not certain.

rabernat commented 5 years ago

Xarray can accept file-like objects to open_dataset and pass them along to h5py.

Here is a gist from @scottyhq which shows this functionality. https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/scottyhq/790bf19c7811b5c6243ce37aae252ca1/raw/e2632e928647fd91c797e4a23116d2ac3ff62372/0-load-hdf5.ipynb

Xarray has no dependence on gcsfs or fsspec. For accessing cloud storage, we are usually using xarray in conjunction with zarr. Zarr also has no dependence on gscfs of fspsec, but it can accept mutable mapping object produced by those libraries which point to cloud storage.

martindurant commented 5 years ago

The quote in the gist is:

Seems that h5py >2.9.0 can handle file-like-objects:

So that's all we need to do general fsspec stuff in intake-xarray/cdf loader. Clearly I was out of date...

I suspect the file download error above is a simple fix, will have a look.

martindurant commented 5 years ago

So the cache code will work now with fsspec from master

weiji14 commented 5 years ago

Thank you so much for the quick fix! I've installed fsspec from master and it now works, though I'm still trying to wrap my head around how the pieces fit together.

I've actually found another problem related to downloading too big a file (?) but I'll raise that in a separate issue.