Open CodyCBakerPhD opened 4 months ago
get_data_shape
is used by the shape validator. If maxshape
is present, then get_data_shape
returns maxshape
. Newly written data will have maxshape
set to None for all axes. This will break the shape validator. I don't understand why get_data_shape
returns maxshape
if present. But get_data_shape
is used in many places for many types of data, including DataChunkIterator
. Maybe the get_data_shape
called for DCI needs to be a different function, or there should be a flag to return maxshape
if present.
@mavaylon1 could you look into this? We can revert that PR if needed.
get_data_shape
is also used by the validator, so I worry that newly written datasets with a shape requirement that is non-None in some axis will be marked as invalid because the dataset will be processed as having shape [None] * ndim
Newly written data will have
maxshape
set to None for all axes.
It seems that this may be too "agressive". If the schema specifies a dimension as fixed in size, then we should not set it to None. I.e, we should only make the dimensions expandable that the schema allows to be expandable. Is this information somehow communicated to the backend in the Builder so that we could adjust the logic that was added in https://github.com/hdmf-dev/hdmf/pull/1093 to only make only dimensions that are not fixed in the schema expandable?
The backend does not have direct access to the schema associated with a builder and is intentionally siloed from the schema. write_builder
, write_group
, write_dataset
, etc. write the builder data as they were received. The builder does not have access to the schema also intentionally. The backend could query for the schema associated with a builder through the BuildManager
, but that breaks the separation. It might be better to have the ObjectMapper
set a maxshape
property on a DatasetBuilder
during build and use that when calling write_dataset
if expandable=True
. Alternatively, the builder could maintain a reference to the spec that it is mapped to. We would have to modify every place that a builder is created. That doesn't seem so bad, but there may be negative consequences of breaking that separation.
get_data_shape
is used by the shape validator. Ifmaxshape
is present, thenget_data_shape
returnsmaxshape
. Newly written data will havemaxshape
set to None for all axes. This will break the shape validator. I don't understand whyget_data_shape
returnsmaxshape
if present. Butget_data_shape
is used in many places for many types of data, includingDataChunkIterator
. Maybe theget_data_shape
called for DCI needs to be a different function, or there should be a flag to returnmaxshape
if present.@mavaylon1 could you look into this? We can revert that PR if needed.
Let me see if I understand. get_data_shape
is used by the shape validator and the validator expects shape and not maxshape. If so, do we want the validator to verify the shape and not maxshape? If that is the case, then I also agree it is weird to have maxshape returned if present.
As for the DCI, having a parameter that is a bool in get_data_shape
where if True, return maxshape if present. The default being false.
For the data that is fixed in size in the schema, I would need to give this more thought. @rly Thoughts?
@mavaylon1 Yes, the validator should validate using actual shape not maxshape.
TODO:
get_data_shape
to return shape unless the new bool parameter for maxshape is True.@rly I am thinking about the problem for shapes defined in the schema. How are these allowed to be written? By setting maxshape right at the end in dset, I think we are skipping shape checks that would've prevented the data to be written in the first place. This leads to on read throwing a fit. This assumes there is a check.
I think if on write we use the shape from the schema, this should leave read alone.
I think shape is validated before write in the docval of init of the particular container class. If there is a custom container class, then the shape validation in docval is supposed to be consistent with the schema. If the container class is auto-generated, then the shape validation parameters in docval are generated from the schema.
I'm not sure if the shape is being validated elsewhere. It is on the todo list to run the validator before write though.
If get_data_shape
returns actual shape instead of maxshape during validation, then the errors should be fixed. However, the maxshape
would be set to None in every axis, which is overly lenient, and such a maxshape does not conform to the shape rule in the schema. So if I understand your last comment correctly, then yes, if we set maxshape to the shape from the schema, then the maxshape would be nice and consistent with the schema.
One edge case is that the shape in the schema can be a list of shape options, e.g., images can have shape [None, None]
, [None, None, 3]
or [None, None, 4]
which correspond to [width, height]
, [width, height, rgb]
, and [width, height, rgba]
. The maxshape
should correspond to whichever shape option out of the allowed shape options from the schema best matches the data - hopefully there is only one, but I haven't thought through all the possible variations.
I think shape is validated before write in the docval of init of the particular container class. If there is a custom container class, then the shape validation in docval is supposed to be consistent with the schema. If the container class is auto-generated, then the shape validation parameters in docval are generated from the schema.
I'm not sure if the shape is being validated elsewhere. It is on the todo list to run the validator before write though.
If
get_data_shape
returns actual shape instead of maxshape during validation, then the errors should be fixed. However, themaxshape
would be set to None in every axis, which is overly lenient, and such a maxshape does not conform to the shape rule in the schema. So if I understand your last comment correctly, then yes, if we set maxshape to the shape from the schema, then the maxshape would be nice and consistent with the schema.One edge case is that the shape in the schema can be a list of shape options, e.g., images can have shape
[None, None]
,[None, None, 3]
or[None, None, 4]
which correspond to[width, height]
,[width, height, rgb]
, and[width, height, rgba]
. Themaxshape
should correspond to whichever shape option out of the allowed shape options from the schema best matches the data - hopefully there is only one, but I haven't thought through all the possible variations.
Yeah I believe the shape is validated in docval. What I was thinking about was the goal you mentioned of having the shape be validated prior to write.
Note: we need to consider this when working on implementing extendable datasets in HDMF again @mavaylon1
What happened?
I believe the merge of https://github.com/hdmf-dev/hdmf/pull/1093 broke some things on NeuroConv
Mainly suspect that because it seems to be the only PR that was merged in last 2 days, and our dev tests were passing fine before then: https://github.com/catalystneuro/neuroconv/actions/workflows/dev-testing.yml
It might be advantageous to setup some dev testing of NeuroConv here on HDMF to ensure PRs don't have ripple effects throughout the ecosystem (for example, NWB Inspector tests against both dev PyNWB and downstream DANDI to ensure both upward and downward compatibility)
The full log: https://github.com/catalystneuro/neuroconv/actions/runs/9006012395/job/24742623348?pr=845
Seems to be affecting all interfaces, caught during roundtrip stage of testing (files seem to write just fine, but don't read back)
Final line of traceback might be the most informative - some shape property has become
None
instead of a finite value (which seems to be expected)Steps to Reproduce
test parameters (HDF5 datasets might have closed following pytest grabbing info)
Traceback
Operating System
macOS
Python Executable
Conda
Python Version
3.12
Package Versions
No response