Open rly opened 2 months ago
As an HDF5 array of 1-byte void dtypes:
>>> import h5py
>>> with h5py.File("test.h5", "w") as f:
... f.create_dataset(name="data", data=np.frombuffer(video_data, dtype="V1"))
...
<HDF5 dataset "data": shape (1048061,), type "|V1">
>>> with h5py.File("test.h5", "r") as f:
... data = f["data"][:].tobytes()
...
As a scalar Zarr array:
>>> import zarr
>>> root = zarr.open('test.zarr', mode='w')
>>> root.create_dataset(name="data", data=np.void(video_data))
<zarr.core.Array '/data' () |V1048061>
>>> root = zarr.open('test.zarr', mode='r')
>>> data = root["data"][()].tobytes()
As a Zarr array of 1-byte void dtypes:
>>> import zarr
>>> root = zarr.open('test.zarr', mode='w')
>>> root.create_dataset(name="data", data=np.frombuffer(video_data, dtype="V1"))
<zarr.core.Array '/data' (1048061,) |V1>
>>> root = zarr.open('test.zarr', mode='r')
>>> data = root["data"][:].tobytes()
When the data are written as a scalar Zarr array, the data are stored in a single chunk, and that chunk is equal to just writing the bytes to disk. For some reason, the fill value is set to "AAAAA...." repeatedly and that makes .zarray
larger than the chunk itself... -.- .
Alternatively, as shown above, we could store the bytes in a dataset with shape (N,)
and dtype V1
would allow for lazy, iterative access that doesn't overload memory on read. The data is chunked in zarr and the fill value is a more reasonable AA==
. That's probably better, but I'm not sure if there would be any unexpected performance impacts.
Related to https://github.com/NeurodataWithoutBorders/nwb-schema/issues/574 to allow the storage of raw binary data that follows a particular format, e.g., MP4, PNG.
In the hdmf schema language, dtype "bytes" maps to variable length string with ascii encoding. In HDMF, if I try to write a MP4 byte stream with dtype "bytes" to an HDF5 file, I get the error
ValueError: VLEN strings do not support embedded NULLs
.Here is the error with a simple h5py-based exmple:
H5py docs recommend against storing raw binary data as variable length strings with an encoding. It says:
To enable storage of raw binary data, I propose we add a new dtype to the schema language that maps to HDF5 OPAQUE / void dtype. We can't use the dtype name
"bytes"
because we use that for ascii data. What about"binary"
?Alternatively, raw binary data could be stored as a 1-D array of uint8 values, but using dtype uint8, as opposed to OPAQUE, may cause accidental conversion.