hdmf-dev / hdmf

The Hierarchical Data Modeling Framework
http://hdmf.readthedocs.io
Other
46 stars 24 forks source link

VectorData Expand by Default via write_dataset #1093

Closed mavaylon1 closed 2 months ago

mavaylon1 commented 3 months ago

Motivation

What was the reasoning behind this change? Please explain the changes briefly.

This change allows the new default behavior for writing VectorData data as expandandable datasets. We do this by providing maxshape to dataset settings that do not already have a defined maxshape set by the user.

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Testts

Checklist

mavaylon1 commented 3 months ago

Notes and Questions:

  1. Does scalar_fill imply that the dataset has only 1 value and should only have 1? There is no doc string. If so I can move the logic for max_shape into list_fill.
  2. There are three cases with references where the shape is defined within required_dataset. I assume this is because get_data_shape returns a shape (#,1) where "#" is the number of references. This is something I found as an edge case. The quickest solution is to set maxshape at each of the three locations. The precedent being each 3 locations repeats shape being set. @oruebel
oruebel commented 3 months ago
  1. Does scalar_fill imply that the dataset has only 1 value and should only have 1?

Are you referring to

https://github.com/hdmf-dev/hdmf/blob/d85d0cbc36d2e0fdb25e8fbea14d58ba7bf24a40/src/hdmf/backends/hdf5/h5tools.py#L1319

If so, this function is used to write scalar datasets, i.e., dataset with a single value.

2. There are three cases with references where the shape is defined within required_dataset.

Could you point to the case you are referring to? require_dataset is usually used to create a dataset it if it doesn’t exist and open the dataset if it does.

2. The quickest solution is to set maxshape at each of the three locations.

This would mean to make all datasets expandable by enabling chunking for all datasets. That is a bit broader approach then to make this the default just for VectorData, but it would make it the default behavior for all (non-scalar) datasets. If that is the approach we'd want to take, then I would suggest adding a parameter enable_chunking=True on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. @rly thoughts?

mavaylon1 commented 3 months ago
  1. Does scalar_fill imply that the dataset has only 1 value and should only have 1?

Are you referring to

https://github.com/hdmf-dev/hdmf/blob/d85d0cbc36d2e0fdb25e8fbea14d58ba7bf24a40/src/hdmf/backends/hdf5/h5tools.py#L1319

If so, this function is used to write scalar datasets, i.e., dataset with a single value.

2. There are three cases with references where the shape is defined within required_dataset.

Could you point to the case you are referring to? require_dataset is usually used to create a dataset it if it doesn’t exist and open the dataset if it does.

2. The quickest solution is to set maxshape at each of the three locations.

This would mean to make all datasets expandable by enabling chunking for all datasets. That is a bit broader approach then to make this the default just for VectorData, but it would make it the default behavior for all (non-scalar) datasets. If that is the approach we'd want to take, then I would suggest adding a parameter enable_chunking=True on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. @rly thoughts?

The enable chunking parameter to give the user the option to turn off the expandable default? If so, would there be a reason they would want to?

oruebel commented 3 months ago

The enable chunking parameter to give the user the option to turn off the expandable default? If so, would there be a reason they would want to?

In my experience it is best to make choices explicit and provide useful defaults rather than hiding configurations. A user may not want to use chunking if they want to use numpy memory mapping to read contiguous datasets.

mavaylon1 commented 2 months ago

@rly I will shoot to have this done by next week (formerly Friday May 3).

mavaylon1 commented 2 months ago

Dev Notes: When writing datasets, we have a few options:

  1. setup_chunked_dset: This is is for datachunkiterator so we do not need to mess with this
  2. __scalar_fill__: scalar datasets do not need to be extended by default by nature of being scalar
  3. setup_empty_dset: This is is for datachunkiterator so we do not need to mess with this
  4. __list_fill__: Write a regular in memory array (e.g., numpy array, list etc.)

From my understanding we only need modify the input parameter options for only list_fill.

Now Oliver mentioned being more explicit with a switch enable_chunking=True (default will be true) on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. this will need to be passed through the chain of methods from write and export to write_dataset.

oruebel commented 2 months ago

From my understanding we only need modify the input parameter options for only list_fill.

Now Oliver mentioned being more explicit with a switch enable_chunking=True (default will be true) on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. this will need to be passed through the chain of methods from write and export to write_dataset.

Yes, I believe that is correct. I think only logic in list_fill should need to be modified and then the enable_chunking setting will need to be passed through. Note, list_fill already is being passed the argument options which contains io_settings so I think you may just need to set chunks=True in the io_settings (if chunks is set to None) to enable chunking. I'm not sure if it will be easiest to do this change if io_settings in list_fill or to update the io_settings outside of list_fill so that list_fill would not need to change at all.

https://github.com/hdmf-dev/hdmf/blob/126bdb100c6d5ce3e2dadd375de9d32524219404/src/hdmf/backends/hdf5/h5tools.py#L1432C5-L1438C53

mavaylon1 commented 2 months ago

Tests:

  1. compound cases: added a check on existing test
  2. regular array data: added a check on existing test
  3. existing tests: updated three tests (refer to changed files)
  4. Check test for when maxshape is already set, that my update does not interfere
codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 88.70%. Comparing base (126bdb1) to head (58f3bf0).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #1093 +/- ## ======================================= Coverage 88.70% 88.70% ======================================= Files 45 45 Lines 9745 9748 +3 Branches 2767 2769 +2 ======================================= + Hits 8644 8647 +3 Misses 779 779 Partials 322 322 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mavaylon1 commented 2 months ago

@oruebel This is mostly done. I need to check/update or write a test that my changes does not do any interference of existing maxshape settings. (And do another pass through to make sure the logic is efficient) However, the main point I want to bring up is your idea of having a parameter for turning on and off the expandability. This would mean HDMFIO has a parameter that is not used in ZarrIO. In fact there are tests failing due to the parameter not being recognized. I see we have two options:

  1. Update hdmf-zarr to accept and do nothing with the arg
  2. Don't have the arg I do like the explicit nature of having it, but I also think having this trickle into hdmf-zarr is not clean.
oruebel commented 2 months ago

This would mean HDMFIO has a parameter that is not used in ZarrIO.

I don't think the parameter needs to be in HDMFIO. I think it's ok to just add as a parameter in HDF5IO

mavaylon1 commented 2 months ago

This would mean HDMFIO has a parameter that is not used in ZarrIO.

I don't think the parameter needs to be in HDMFIO. I think it's ok to just add as a parameter in HDF5IO

HDF5IO write needs to call write_builder. It does that by calling super().write(**kwargs). This then gets us to HDMFIO write, which calls write_builder.

oruebel commented 2 months ago

HDF5IO write needs to call write_builder. It does that by calling super().write(**kwargs). This then gets us to HDMFIO write, which calls write_builder.

Yes, but HDMFIO.write allows extra keyword arguments:

https://github.com/hdmf-dev/hdmf/blob/126bdb100c6d5ce3e2dadd375de9d32524219404/src/hdmf/backends/io.py#L80-L81

and those are being passed through to write_builder

https://github.com/hdmf-dev/hdmf/blob/126bdb100c6d5ce3e2dadd375de9d32524219404/src/hdmf/backends/io.py#L99

So you can add custom keyword arguments without having to add them in HDMFIO. HDF5IO already has several additional arguments on write and write_builder that are not in HDMFIO, such as the exhaust_dci parameter.

mavaylon1 commented 2 months ago

HDF5IO write needs to call write_builder. It does that by calling super().write(**kwargs). This then gets us to HDMFIO write, which calls write_builder.

Yes, but HDMFIO.write allows extra keyword arguments:

https://github.com/hdmf-dev/hdmf/blob/126bdb100c6d5ce3e2dadd375de9d32524219404/src/hdmf/backends/io.py#L80-L81

and those are being passed through to write_builder

https://github.com/hdmf-dev/hdmf/blob/126bdb100c6d5ce3e2dadd375de9d32524219404/src/hdmf/backends/io.py#L99

So you can add custom keyword arguments without having to add them in HDMFIO. HDF5IO already has several additional arguments on write and write_builder that are not in HDMFIO, such as the exhaust_dci parameter.

Well isn't that just right in front of my face.

mavaylon1 commented 2 months ago

Notes: My approach to the tests:

  1. Make sure the default is now expandable. This is accomplished by adding an additional assert on existing test. This is done for both regular and compound data.
  2. Make sure the default does not override user chunking. This is accomplished by existing tests passing.
  3. Make sure when told "false" that it turns off said chunking.
mavaylon1 commented 2 months ago

I added some minor suggestions, but otherwise this looks good to me.

Thanks for the quick review. I will make the doc string more detailed, but take a look at my comments for the other changes. The pass was a deliberate thing (vs a left over from a draft) and I like the warning.

rly commented 2 months ago

Could you add documentation on how to expand a VectorData?

It looks like creation of a dataset of references is not modified here. Some tables in NWB contain columns that are all references, e.g., the electrode table has a column with references to the ElectrodeGroup. I think such datasets should be expandable as well.

mavaylon1 commented 2 months ago

Could you add documentation on how to expand a VectorData?

It looks like creation of a dataset of references is not modified here. Some tables in NWB contain columns that are all references, e.g., the electrode table has a column with references to the ElectrodeGroup. I think such datasets should be expandable as well.

Yeah the lack of dataset of references support was just a smaller scope for this idea. I agree this makes a lot of sense to have. I will make this an issue ticket.

As for the expansion of VectorData documentation, I thought we had that. Maybe I am thinking of the HDF5 documentation, but I will look. If it does not exist, I will loop that into the ticket for dataset of references.