Add binary/opaque dtype

rly commented 2 months ago

Related to https://github.com/NeurodataWithoutBorders/nwb-schema/issues/574 to allow the storage of raw binary data that follows a particular format, e.g., MP4, PNG.

In the hdmf schema language, dtype "bytes" maps to variable length string with ascii encoding. In HDMF, if I try to write a MP4 byte stream with dtype "bytes" to an HDF5 file, I get the error ValueError: VLEN strings do not support embedded NULLs.

Here is the error with a simple h5py-based exmple:

import h5py
f = h5py.File("test.h5", "w")
f.create_dataset(name="data", data=video_data, dtype=h5py.string_dtype('ascii'))
# NOTE: h5py.string_dtype('ascii') is equivalent to h5py.special_dtype(vlen=bytes)
# NOTE: f.create_dataset(name="data", data=video_data) assumes the data is a string and will return the same error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rly/mambaforge/envs/temp/lib/python3.11/site-packages/h5py/_hl/group.py", line 183, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rly/mambaforge/envs/temp/lib/python3.11/site-packages/h5py/_hl/dataset.py", line 166, in make_new_dset
    dset_id.write(h5s.ALL, h5s.ALL, data)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 282, in h5py.h5d.DatasetID.write
  File "h5py/_proxy.pyx", line 147, in h5py._proxy.dset_rw
  File "h5py/_conv.pyx", line 442, in h5py._conv.str2vlen
  File "h5py/_conv.pyx", line 96, in h5py._conv.generic_converter
  File "h5py/_conv.pyx", line 254, in h5py._conv.conv_str2vlen
ValueError: VLEN strings do not support embedded NULLs

H5py docs recommend against storing raw binary data as variable length strings with an encoding. It says:

If you have a non-text blob in a Python byte string (as opposed to ASCII or UTF-8 encoded text, which is fine), you should wrap it in a void type for storage. This will map to the HDF5 OPAQUE datatype, and will prevent your blob from getting mangled by the string machinery.

To enable storage of raw binary data, I propose we add a new dtype to the schema language that maps to HDF5 OPAQUE / void dtype. We can't use the dtype name "bytes" because we use that for ascii data. What about "binary"?

>>> import h5py
>>> with h5py.File("test.h5", "w") as f:
...     f.create_dataset(name="data", data=np.void(video_data))
... 
<HDF5 dataset "data": shape (), type "|V1048061">
>>> with h5py.File("test.h5", "r") as f:
...     data = f["data"][()].tobytes()
...

Alternatively, raw binary data could be stored as a 1-D array of uint8 values, but using dtype uint8, as opposed to OPAQUE, may cause accidental conversion.

rly commented 2 months ago

As an HDF5 array of 1-byte void dtypes:

>>> import h5py
>>> with h5py.File("test.h5", "w") as f:
...     f.create_dataset(name="data", data=np.frombuffer(video_data, dtype="V1"))
... 
<HDF5 dataset "data": shape (1048061,), type "|V1">
>>> with h5py.File("test.h5", "r") as f:
...     data = f["data"][:].tobytes()
...

As a scalar Zarr array:

>>> import zarr
>>> root = zarr.open('test.zarr', mode='w')
>>> root.create_dataset(name="data", data=np.void(video_data))
<zarr.core.Array '/data' () |V1048061>
>>> root = zarr.open('test.zarr', mode='r')
>>> data = root["data"][()].tobytes()

As a Zarr array of 1-byte void dtypes:

>>> import zarr
>>> root = zarr.open('test.zarr', mode='w')
>>> root.create_dataset(name="data", data=np.frombuffer(video_data, dtype="V1"))
<zarr.core.Array '/data' (1048061,) |V1>
>>> root = zarr.open('test.zarr', mode='r')
>>> data = root["data"][:].tobytes()

rly commented 2 months ago

When the data are written as a scalar Zarr array, the data are stored in a single chunk, and that chunk is equal to just writing the bytes to disk. For some reason, the fill value is set to "AAAAA...." repeatedly and that makes .zarray larger than the chunk itself... -.- .

Alternatively, as shown above, we could store the bytes in a dataset with shape (N,) and dtype V1 would allow for lazy, iterative access that doesn't overload memory on read. The data is chunked in zarr and the fill value is a more reasonable AA==. That's probably better, but I'm not sure if there would be any unexpected performance impacts.

hdmf-dev / hdmf-schema-language

Add binary/opaque dtype #34