Closed satra closed 2 years ago
intake-xarray will pass arguments to xarray, which should pass them on to h5py (or h5netcdf). I don't know anything about the ros2 drive, but I don't see why it shouldn't work already. Note, though, that h5py already does allow partial access to files on s3/http (or any other fsspec implementation).
i looked at the xarray documentation and tried a few things. as far as i could tell there is no direct way of using open_dataset
with an s3 http url and asking the engine to be h5py.
with h5py i can simply do: f = h5py.File(url, driver="ros3")
with h5py i can simply do
Is that something you could then, in turn, open with xarray?
Can you please fill me in on what ROS3 is, and why it's important? I can't find much information about it, except that it exists.
Is that something you could then, in turn, open with xarray?
do you mean something like this? if so, that doesn't work returns an error about supported engines.
f = h5py.File(url, driver="ros3")
d = xr.open_dataset(f)
the ROS3 virtual file driver provides streaming access to an HDF5 file. this allows partial access across computational or visualization processes, similar to zarr or other streaming data forms. this includes support for compression and chunking as is common in hdf5 files.
many of the files we are dealing with are 100s of GBs in size, so it's really nice to be able to use the stream method. we can currently use h5py. we were just trying to work out streaming support of hdf5 through a common data model such as xarray. if it can operate on a local h5 file, it should in theory work on a remote one without downloading all the data.
It seems to me that all of that already works with s3fs, i.e., xr.open(“s3://..”, engine=“h5netcdf”)
.
On Sep 29, 2021, at 11:29, Satrajit Ghosh @.***> wrote:
Is that something you could then, in turn, open with xarray?
do you mean something like this? if so, that doesn't work returns an error about supported engines.
f = h5py.File(url, driver="ros3") d = xr.open_dataset(f)
the ROS3 virtual file driver provides streaming access to an HDF5 file. this allows partial access across computational or visualization processes, similar to zarr or other streaming data forms. this includes support for compression and chunking as is common in hdf5 files.
many of the files we are dealing with are 100s of GBs in size, so it's really nice to be able to use the stream method. we can currently use h5py. we were just trying to work out streaming support of hdf5 through a common data model such as xarray. if it can operate on a local h5 file, it should in theory work on a remote one without downloading all the data.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
yup i got this to work both ways:
import h5py as h5
import hdf5plugin
import xarray as xr
import s3fs
import numpy as np
url = "https://dandiarchive.s3.amazonaws.com/blobs/e08/02c/e0802c3e-5492-42a6-960a-bcf5b3cfd239"
f = h5.File(url, driver="ros3")
data = f["6"]
s3_url = "s3://dandiarchive/blobs/e08/02c/e0802c3e-5492-42a6-960a-bcf5b3cfd239"
fs = s3fs.S3FileSystem(anon=True)
with fs.open(s3_url, "rb") as fp:
ds = xr.open_dataset(fp, engine="h5netcdf", phony_dims='access')
data2 = ds["6"].data
np.allclose(data, data2)
Note that s3fs (and fsspec in general) provides various bytes caching options, that can make a big difference to access times. I recommend s3.open(..., cache_type="first")
for HDF5 data.
thanks @martindurant will test things out further.
h5py now has support for the ros3 driver, which allows streaming access for pieces of hdf5 files on s3 over http using the hdf5 library and can work alongside hdf5plugin options. this is to check where would be an appropriate way to include support for this.