hdmf-dev / hdmf-schema-language

The specification language for HDMF
https://hdmf-schema-language.readthedocs.io
Other
0 stars 2 forks source link

Clarify whether fields can be added to groups/dsets with only `data_type_inc` #13

Closed rly closed 4 months ago

rly commented 3 years ago

Raised in https://github.com/hdmf-dev/hdmf/pull/542 in the context of how extra fields are treated.

It is not clear in the documentation whether the schema language allows new fields for groups/datasets defined with only a data_type_inc. We should clarify this in the documentation.

It is my understanding that it is allowed (though perhaps not recommended) for fields to be added to groups/datasets defined with only a data_type_inc.

For example: a new group may contain a dataset with:

- data_type_def: MyTable
  datasets:
  - name: new_column
    data_type_inc: VectorData
    attributes:
    - name: new_attr
      dtype: text
      doc: a new attribute

A concrete example is in the Units table in NWB: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/dev/core/nwb.misc.yaml#L187-L199

The resolution attribute is added to the base VectorData definition.

Or a new group may be defined as:

- data_type_def: MyContainer
  groups:
  - name: new_group
    data_type_inc: Container
    groups:
    - name: new_subgroup
      data_type_inc: Container
      doc: a new subgroup
    # also add new datasets, new links, and new attributes

A related but not fully matching example is the 'electrodes' table in the NWBFile: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/dev/core/nwb.file.yaml#L269

The 'electrodes' table lacks a data_type_def but specifies many named VectorData types. However, this differs from the above because DynamicTable explicitly allows quantity: * of data_type_inc: VectorData without a name specified.

Note that a group/dataset defined with only a data_type_inc may have a different doc, quantity, dims, shape, and sometimes even dtype than in the data_type_def definition, as long as these are compatible with the included spec (e.g., a square is a special case of a rectangle). This should also be clarified in the documentation and fully supported in HDMF.

If adding fields to types with only data_type_inc is not allowed, then the NWB schema should be amended. If the above is allowed, then the HDMF validator should be amended to validate against the spec with any additions to the included type.

oruebel commented 3 years ago

Effectively, this is the question of "monkey-patching", i.e., we have an existing type and add properties without changing the class type. If we ignore backward compatibility, then my vote would be to require that we create a new data_type_def in those cases, if only to make this case easier to handle. However, I assume this would mess with a bunch of existing schema, as such I think it may be best to: 1) Explicitly discourage this in the schema docs, 2) Add a test on the ndx-catalog that checks for this and raises a warning to encourage better schema, 2) continue support as is in the API.

rly commented 3 years ago

Yes, removing those cases would mess with existing schema, so I think that plan of action makes sense. And even if we remove those cases from existing schema, I think we would have to continue to support them in order to be able to read and validate files with older schema.

To be clear, based on this, in the example where the Units table has a VectorData column for spike_times and we want to add an attribute to that column, we should create a new data type SpikeTimesColumn that inherits from VectorData and adds the attribute, and use that in the Units table instead, right?

What about the cases when fields are not added but properties of a data type are modified (restricted), e.g., VectorData allows any dtype and shape, but a new data type includes a VectorData and restricts dtype to int and shape to (None, )? To be clear, these cases should not be discouraged, right?

oruebel commented 3 years ago

Since we need to support this case anyways, I think this is ultimately then more like a best practice. I think the best practice should be:

rly commented 1 year ago

TODO: Add warnings in HDMF extensions API when adding new elements to a group/dataset with only data_type_inc.

rly commented 4 months ago

Discussion should be continued in #14