hdmf-dev / hdmf-common-schema

Specifications for pre-defined data structures provided by HDMF.
Other
3 stars 7 forks source link

Usage of Compound `dtype`? #78

Closed sneakers-the-rat closed 8 months ago

sneakers-the-rat commented 9 months ago

Question for y'all -

writing some tests for my translation, and noticing that the newer HDMF objects in resources use compound dtypes for single-column datasets, eg https://github.com/hdmf-dev/hdmf-common-schema/blob/c538bc5ef2d1c18036e6973d784085ff851c6dfe/common/resources.yaml#L10

Are these intended to be treated in a distinct way from standard dtypes? I ask because if i am reading it correctly, datasets with flat dtypes without explicit quantity or shape are scalars ( like eg. https://github.com/NeurodataWithoutBorders/nwb-schema/blob/ec0a87979608c75a785d3a42f61e3366846ed3c2/core/nwb.file.yaml#L27 ), according to the schema language definition:

The default value is quantity=1.

The default behavior for shape is:

shape: null

indicating that the attribute/dataset is a scalar.

but it seems like these are intended to be vectors/table-like.

I am also curious if those will still be stored as compound dtype in the HDF5 file? Mostly because those columns have special indexing syntax and want to know if i should plan for that.

I am asking mostly because I can't find an example dataset that uses these classes, if you have one handy I can use that to answer future questions instead of pestering y'all for the millionth time <3

rly commented 9 months ago

Datasets (and attributes) without explicit shape values are scalars. This should be true for both flat and compound dtypes.

A compound dtype with a single field (column) is different from a flat dtype because the single value has a name and docstring. I think the primary reason we made the KeyTable that way is to identify the dataset as a row-based table, but I don't remember if there was another reason. @oruebel do you remember?

I had thought that these datasets where the dtype is a compound dtype with a single column would be stored as compound dtypes in the HDF5 file. But I was wrong. It turns out when either HDMF or h5py (not sure which) writes a compound dtype where the field dtypes are all the same, it just writes the data with a flat dtype. If the compound dtype has multiple fields, then it adds a dimension for the number of fields. For example, this is written as an flat text dataset with shape (N, ), and this is written as a flat text dataset with shape (N, 2). That might not be the intended behavior though. Let me think about this.

However, data for these "resources" tables will not be written to HDF5 but will be written as tsv files placed adjacent to the HDF5 file for easy access, modification, and sharing. So we don't have an example file that writes these "resources" tables to HDF5 unfortunately. We can create one in an atypical way though:

from hdmf.common.resources import KeyTable, ObjectKeyTable, ObjectTable
from hdmf.common import SimpleMultiContainer
from hdmf.common import get_hdf5io
key_table = KeyTable()
key_table.add_row("test")
object_key_table = ObjectKeyTable()
object_key_table.add_row(0, 0)
# object_table = ObjectTable()
# object_table.add_row(0, "test", "test", "test", "test")
container = SimpleMultiContainer(name="test", containers=[key_table, object_key_table])
with get_hdf5io("test.h5", "w") as io:
    io.write(container)

Instead of instantiating these tables directly, you can also use HERD for external resources, which we are working on. Here is a reference: https://hdmf.readthedocs.io/en/stable/tutorials/plot_external_resources.html#sphx-glr-tutorials-plot-external-resources-py

sneakers-the-rat commented 9 months ago

Aha! Well that simplifies things if they are a wholly different kind of thing, I wont worry about it for now :) thanks so much for the quick response!

I just redid how I was doing hdf5 -> pydantic translations in a way that should make it easier to adapt to changes that might include additional files. I am trying to keep my implementation independent from pynwb and hdmf for the sake of practicing this whole "standards" thing - so far outside of the issues ive raised ive been able to fully replicate from docs so thats a good sign the standard is, well, standard! For now I have a sort of "cheap" graphlike reader - in multiple passes, try and resolve what can be resolved in that pass, and come back for other pieces when they're ready. Thats handy for cases where there are multiple potential ways a model field can be stored (eg. As an attr, a dataset, a column within a dataset) bc you can just try them all and see what resolves. I split it up into fast checks and slow applies so its not too costly either. That'll be expanded to a "true" graph traversal reader that should be able to crawl across files as well as within them, its sort of important to what I have in mind here with interop with datasets/formats in different schema langs and serializations, so handling external resources indexed in several files sounds like a fun next step :)

Edit: question answered, feel free to close unless keeping it open is helpful for some reason ❤️