hdmf-dev / hdmf

The Hierarchical Data Modeling Framework
http://hdmf.readthedocs.io
Other
46 stars 24 forks source link

[Bug]: When appending to file with newer schema, the older schema is not deleted #1098

Open rly opened 2 months ago

rly commented 2 months ago

What happened?

See https://github.com/flatironinstitute/neurosift/issues/157

Currently, a file may contain multiple versions of a schema, e.g., "core" 1.5.0 and "core" 1.6.0 after a user appends to a file that has an older version of a namespace cached than the one currently loaded by pynwb/hdmf. I don't see the value of that, given that individual neurodata type objects are attached to a namespace and not a namespace & version. It adds confusion.

I think, when writing the file and the schema is cached, the older version of a namespace should be deleted and the newer one added.

This raises the use case where the existing file has a cached schema and the user chooses not to cache the new schema., but the data are written with the new schema. We should not allow that. Is there a use case where we don't want to cache the schema on write? Should we just disallow that to prevent edge cases during append/modification like this? I think so.

Related: https://github.com/NeurodataWithoutBorders/pynwb/issues/967 which mentions possible performance slowdowns, but they seem minor, I think the value of caching the schema trumps that, and we can optimize the writing of the schema separately - it is just a group with a few string datasets.

Steps to Reproduce

Create a file with an older version of a schema (cache=True by default)
Append to the file using the API that uses a newer version of the schema (cache=True by default)
Inspect the file

Traceback

No response

Operating System

macOS

Python Executable

Conda

Python Version

3.12

Package Versions

No response

rly commented 2 months ago

@stephprince I assigned you this because you'll be working with namespaces and validation soon, and this is related.

oruebel commented 2 months ago

I don't see the value of that, given that individual neurodata type objects are attached to a namespace and not a namespace & version.

This assumes that the schema (not just the API are compatible). E.g., let's say you stored an OpticalSeries with version 2.1 in a file and then modified the definition of OpticalSeries in version 2.?. Erasing and replacing the namespace now implies that the older OpricalSeries follows 2.? when it actually is 2.1. In particular for extensions, I think that is problematic behavior. The actual solution would be to also store the version of the namespace. HDMF/PyNWB does not support multiple versions of the same namespace, which is probably why this was omitted, but I think it would be worth considering supporting namespace <version> values as part of the namespace attribute.

rly commented 2 months ago

Under that example, different objects of the NWB file are valid only under different schema. The file could fail validation (which validates all objects under one schema) under both the new and old schema. Allowing this would open a big can of edge case worms. I would rather be more restrictive and say that saving an NWB file using the newest schema is allowed only if all objects are valid in the newest schema. The new schema may not be compatible with the old schema, though I can't think of a great example of that right now. We could validate before write to check this, which is something we were thinking of doing anyway.

oruebel commented 2 months ago

I would rather be more restrictive and say that saving an NWB file using the newest schema is allowed only if all objects are valid in the newest schema.

So that means validating before updating a file to the newer schema?

rly commented 2 months ago

If the file with an older schema is opened in pynwb/hdmf that has loaded a newer schema, probably we should validate the file against the newer schema in case there are weird discrepancies that are not accounted for. During the write process, we should validate the builders against the newer schema and not allow the user to proceed if something is invalid (this usually means the API is buggy because it allowed the user to create non-conforming builders).

oruebel commented 2 months ago

@rly do If I understand this correctly, the proposal is effectively to create a migration path to upgrade a file a new schema

rly commented 2 months ago

@oruebel Effectively, yes.

Here is a concrete use case: In NWB 2.1.0, NWBFile datasets "experimenter" and "related_publications" changed from scalar to 1-D arrays. Now, PyNWB in the object mapper reads a scalar "experimenter" value as a 1-D array in the NWBFile container. If a user opens a pre-2.1.0 file in current PyNWB in append mode and adds a TimeSeries, currently, I believe only the TimeSeries is written. NWBFile.experimenter is not rewritten as a 1-D array. The current 2.7.0 schema would be appended to the file alongside the pre-2.1.0 schema. Technically, this data is invalid against the 2.7.0 schema unless we add an exception to the validator (I think we actually don't have that right now but it hasn't come up...)

Before we can delete the pre-2.1.0 schema from the cache, we should migrate the no-longer-valid scalar "experimenter" value to a valid 1D array in memory and write that change to the file on write.

rly commented 2 months ago

We can use the modified flags on containers and builders to flag that a data type has been migrated, but I'm not sure that would replace a dataset. I also think we should not rewrite all datasets in the NWBFile. We'll need to test that out...

oruebel commented 2 months ago

Before we can delete the pre-2.1.0 schema from the cache, we should migrate the no-longer-valid scalar "experimenter" value to a valid 1D array in memory and write that change to the file on write.

This may not be trivial. Changing the data type, shape etc. of datasets and attributes is not possible in HDF5. I think one would need to first remove the existing dataset (or attribute) in that case and create a new one with the same name.