intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
74 stars 36 forks source link

hdf5 access with ros3 driver #108

Closed satra closed 2 years ago

satra commented 2 years ago

h5py now has support for the ros3 driver, which allows streaming access for pieces of hdf5 files on s3 over http using the hdf5 library and can work alongside hdf5plugin options. this is to check where would be an appropriate way to include support for this.

martindurant commented 2 years ago

intake-xarray will pass arguments to xarray, which should pass them on to h5py (or h5netcdf). I don't know anything about the ros2 drive, but I don't see why it shouldn't work already. Note, though, that h5py already does allow partial access to files on s3/http (or any other fsspec implementation).

satra commented 2 years ago

i looked at the xarray documentation and tried a few things. as far as i could tell there is no direct way of using open_dataset with an s3 http url and asking the engine to be h5py.

with h5py i can simply do: f = h5py.File(url, driver="ros3")

martindurant commented 2 years ago

with h5py i can simply do

Is that something you could then, in turn, open with xarray?

Can you please fill me in on what ROS3 is, and why it's important? I can't find much information about it, except that it exists.

satra commented 2 years ago

Is that something you could then, in turn, open with xarray?

do you mean something like this? if so, that doesn't work returns an error about supported engines.

f = h5py.File(url, driver="ros3")
d = xr.open_dataset(f)

the ROS3 virtual file driver provides streaming access to an HDF5 file. this allows partial access across computational or visualization processes, similar to zarr or other streaming data forms. this includes support for compression and chunking as is common in hdf5 files.

many of the files we are dealing with are 100s of GBs in size, so it's really nice to be able to use the stream method. we can currently use h5py. we were just trying to work out streaming support of hdf5 through a common data model such as xarray. if it can operate on a local h5 file, it should in theory work on a remote one without downloading all the data.

martindurant commented 2 years ago

It seems to me that all of that already works with s3fs, i.e., xr.open(“s3://..”, engine=“h5netcdf”).

On Sep 29, 2021, at 11:29, Satrajit Ghosh @.***> wrote:

Is that something you could then, in turn, open with xarray?

do you mean something like this? if so, that doesn't work returns an error about supported engines.

f = h5py.File(url, driver="ros3") d = xr.open_dataset(f)

the ROS3 virtual file driver provides streaming access to an HDF5 file. this allows partial access across computational or visualization processes, similar to zarr or other streaming data forms. this includes support for compression and chunking as is common in hdf5 files.

many of the files we are dealing with are 100s of GBs in size, so it's really nice to be able to use the stream method. we can currently use h5py. we were just trying to work out streaming support of hdf5 through a common data model such as xarray. if it can operate on a local h5 file, it should in theory work on a remote one without downloading all the data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

satra commented 2 years ago

yup i got this to work both ways:

import h5py as h5
import hdf5plugin
import xarray as xr
import s3fs
import numpy as np

url = "https://dandiarchive.s3.amazonaws.com/blobs/e08/02c/e0802c3e-5492-42a6-960a-bcf5b3cfd239"
f = h5.File(url, driver="ros3")
data = f["6"]

s3_url = "s3://dandiarchive/blobs/e08/02c/e0802c3e-5492-42a6-960a-bcf5b3cfd239"
fs = s3fs.S3FileSystem(anon=True)
with fs.open(s3_url, "rb") as fp:
    ds = xr.open_dataset(fp, engine="h5netcdf", phony_dims='access')
    data2 = ds["6"].data

np.allclose(data, data2)
martindurant commented 2 years ago

Note that s3fs (and fsspec in general) provides various bytes caching options, that can make a big difference to access times. I recommend s3.open(..., cache_type="first") for HDF5 data.

satra commented 2 years ago

thanks @martindurant will test things out further.