p2p-ld / numpydantic

Type annotations for specifying, validating, and serializing arrays with arbitrary backends in Pydantic (and beyond)
https://numpydantic.readthedocs.io/
MIT License
65 stars 0 forks source link

Add ability to index fields within hdf5 compound dtypes #11

Closed sneakers-the-rat closed 2 months ago

sneakers-the-rat commented 2 months ago

HDF5 can have compound dtypes like:

dtype = np.dtype([("data", "i8"), ("extra", "f8")])
data = np.zeros((10, 20), dtype=dtype)
with h5py.File(h5f_path, "w") as h5f:
    dset = h5f.create_dataset("/dataset", data=data)
>>> dset[0:1]
array([[(0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.),
        (0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.),
        (0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.)]],
      dtype=[('data', '<i8'), ('extra', '<f8')])

Sometimes we want to split those out to separate fields like this:

class MyModel(BaseModel):
    data: NDArray[Any, np.int64]
    extra: NDArray[Any, np.float64]

So that's what this PR allows, using an additional field in the H5ArrayPath:

from numpydantic.interfaces.hdf5 import H5ArrayPath

my_model = MyModel(
    data = H5ArrayPath(file='mydata.h5', path="/dataset", field="data"),
    extra = H5ArrayPath(file='mydata.h5', path="/dataset", field="extra"),
)

# or just with tuples
my_model = MyModel(
    data = ('mydata.h5', "/dataset", "data"),
    extra = ('mydata.h5', "/dataset", "extra"),
)
>>> my_model.data[0,0]
0
>>> my_model.data.dtype
np.dtype('int64')