Closed weiji14 closed 5 years ago
if it is served over a HDF REST API interface
yes, it is possible to set of a hdf server specifically for a purpose like this, but of course that's only relevant if you control both sides of the communication. You could do the same and I would argue easier) with the Intake server too.
So, no I don't think hdf can be loaded from normal http. s3 support was added recently, and given that both s3 and http can be read by fsspec, it would not take much for that to be supported too.
In the meantime, you can indeed use Intake caching to download the file to local and read that, like here for some image data. These spec blocks are a little tricky to get right, see the Intake documentation.
It seems, unfortunately, that h5netcdf assumes you are trying to use h5pyd when you give a URL. We could plausibly change this, since caching at the file system layer is also now possible thanks to fsspec. What do you think, @jsignell ?
I agree that in the short term either fsspec or intake-level caching would solve this issue by caching the whole file locally before trying to access it. So that seems like the best solution unless there is some strong need to only access part of the file.
It may be worth asking on the h5py (or h5netcdf?) about the status of remote access. If they can do s3, why not other ones...
Actually, the following totally works:
import fsspec
url = "https://gamma.hdfgroup.org/ftp/pub/outgoing/NASAHDF/ATL06_20190223232535_08780212_001_01.h5"
with fsspec.open(url) as f:
ds = xr.open_dataset(f)
... so should the netCDF driver be changed to assume URLs are fsspec-openable things, rather than passing to xarray? Should we ask an XR person?
If you wanted FS-level caching on the above, you would do
import fsspec
url = "filecache://gamma.hdfgroup.org/ftp/pub/outgoing/NASAHDF/ATL06_20190223232535_08780212_001_01.h5"
with fsspec.open(url, target_protocol='http', cache_storage="/path/to/cache") as f:
ds = xr.open_dataset(f)
Right, I definitely think using fsspec
is the way to go since it's meant for reducing code duplication.
That file-caching method looks really awesome! Now the big question is where to insert that piece of logic - here on intake-xarray
or upstream at intake
. Just trying to think of other file formats (besides hdf5) that might find this useful too which might help us decide where to put it.
Most drivers for Intake already use fsspec, since they call python libraries which are happy with the python file-like interface. That did not include HDF5, but it seems it now does. @jhamman @rabernat , is that now generally true, that xarray happily takes file objects for the various backends? If yes, must any of them be specifically local (i.e., with an OS file handle) files?
Hmm, I just tried using the filecache://
-based code block you mentioned above to download the file. The file actually downloads fine, but it emits this error:
ValueError: Got more bytes (60681386) than requested (0)
Not sure if it's just the example we're using, but looking at this, it seems the http headers isn't giving the right Content-Length
for the HDF5 file.
... so should the netCDF driver be changed to assume URLs are fsspec-openable things, rather than passing to xarray? Should we ask an XR person?
@jhamman @rabernat , is that now generally true, that xarray happily takes file objects for the various backends? If yes, must any of them be specifically local (i.e., with an OS file handle) files?
A quick search for fsspec
in the xarray code repository shows up nothing... I think they're still using gcsfs and s3 explicitly rather than through fsspec?
A quick search for fsspec in the xarray code repository shows up nothing
That's not what I meant - we are passing a file-like object here, and I'm wondering what assumptions are made about it within xarray and the libraries it calls. Not long ago, it used to extract the path or file handle and load that in the C code, which would of course not work for something remote. I believe it may now be checking explicitly for s3 and http paths and handling them (instead of using the object directly), but I'm not certain.
Xarray can accept file-like objects to open_dataset and pass them along to h5py.
Here is a gist from @scottyhq which shows this functionality. https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/scottyhq/790bf19c7811b5c6243ce37aae252ca1/raw/e2632e928647fd91c797e4a23116d2ac3ff62372/0-load-hdf5.ipynb
Xarray has no dependence on gcsfs or fsspec. For accessing cloud storage, we are usually using xarray in conjunction with zarr. Zarr also has no dependence on gscfs of fspsec, but it can accept mutable mapping object produced by those libraries which point to cloud storage.
The quote in the gist is:
Seems that h5py >2.9.0 can handle file-like-objects:
So that's all we need to do general fsspec stuff in intake-xarray/cdf loader. Clearly I was out of date...
I suspect the file download error above is a simple fix, will have a look.
So the cache code will work now with fsspec from master
Thank you so much for the quick fix! I've installed fsspec from master and it now works, though I'm still trying to wrap my head around how the pieces fit together.
I've actually found another problem related to downloading too big a file (?) but I'll raise that in a separate issue.
This is just me trying to work out how to access HDF5 files over https unsuccessfully:
Full JSONDecodeError message:
My question is on whether it's feasible to have intake download the HDF5 file from the http URL, and persist it locally, since it's not able to stream it directly (at least not easily, see this blog post on the difficulties of accessing HDF in the cloud).
I actually encountered this problem in my work and saw it mentioned on a stackoverflow question and thought I'd ask. Granted, I'm not sure if this issue is even in the right place, should I move it upstream instead to
intake
, orh5netcdf
where these lines are the key to the error? There's also theh5pyd
library which seems to allow remote access of hdf5 files, if it is served over a HDF REST API interface.Thoughts?