hdmf-dev / hdmf-schema-language

The specification language for HDMF
https://hdmf-schema-language.readthedocs.io
Other
0 stars 2 forks source link

Support specification of at least one of {x, y, z} #17

Open rly opened 1 year ago

rly commented 1 year ago

This comes up in NWB:

oruebel commented 1 year ago

Constraints

Option 1

Support expressions in the required field, e.g., required: if not image_mask and not voxel_mask

Option 2

Define constraint at the parent level, e.g.: constraints: oneOf(pixel_mask, image_mask, voxel_mask)

Option 3 (suggested by @rly below)

Define specific keys for defining constraints, e.g., anyOf, oneOf, and allOf.

Comment:

rly commented 1 year ago

Related solutions:

JSON schema has the anyOf, oneOf, and allOf keys, each of which takes a json (sub)schema. Here is a (meta) example from the JSON schema that is used to validate the NWB schema: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/dev/nwb.schema.json#L250

LinkML also has the any_of, exactly_one_of, and all_of slots, each of which takes a set of expressions

oruebel commented 1 year ago

Related solutions:

Thanks @rly . That seems to be variant on option 2 with the constraint being defined by the key. I'll update the options list above just so we have the options in one place.

oruebel commented 1 year ago

Looking at the options, I think there are two distinct design decisions to make:

  1. Where should the required constraints be specified: i) as part of the required key of each object, or ii) outside as part of the parent object that contains the relevant object
  2. How should the contains be expressed: i) Via keys acting as functions (e.g., anyOf, oneOf etc.) or ii) via conditional expressions, e.g., if not (pixel_mask, voxel_mask)
rly commented 1 year ago

In the extreme, where there are many keys, Option 1 is clunky. Each dataset needs to specify a unique constraint compared to the other datasets. Adding or removing one key requires editing them all. Consistency is difficult to maintain. "one of {x, y, z}" is defined by a combination of 3 different "required: if not x and not y", "required: if not y and not z", and "required: if not x and not z" specifications. It is more difficult to understand what is going on.

Options 2 and 3 are more intuitive to a naive reader in my opinion, and to people familiar with json schema, which is fairly popular. I prefer to defer to existing solutions if they suit our needs. Having "required" be a part of the key never really made sense to me. Same as "quantity". These are fields that specify the relationship between the parent and the child. Rarely do we ever look at a child in isolation and when we do, it does not make sense to define "required" on it, because it is a property of the parent-child relationship. For example, a standalone data type def should never be defined with a "required" field. As such, I think it is OK to have "oneOf" be part of the parent, adjacent to the child. I think we should consider moving "required" outside the child as well if we end up making this change...

oruebel commented 1 year ago

As such, I think it is OK to have "oneOf" be part of the parent, adjacent to the child. I think we should consider moving "required" outside the child as well if we end up making this change...

If I understand your suggestion correctly you are proposing to have quantities (and for attributes required) properties be specified in the parent that "owns" the object rather than by the object itself. I.e., the quantities for datasets would be specified by the group that contains the dataset, rather than by the dataset itself. I.e., something like:

groups:
- neurodata_type_def: DataSeriesExample
  neurodata_type_inc: TimeSeries
  doc: ...
  attributes:
  - name: data_gain
    dtype: float32
    doc: data gain.
  datasets:
  - name: field_xyz
    dtype: float32
    dims:
    - x|y|z
    shape:
    - 3
    doc: x,y,z in meters
 - name: field_xy
    dtype: float32
    dims:
    - x|y
    shape:
    - 3
    doc: x,y in meters
 quantities:
    - oneOf:
      - - field_xyz 
        - field_xy
      - ...
    - required:
      - data_gain
    - optional:
    -  ... 

In principle, I think this is a reasonable approach. The main challenge with such an approach is compatibility, since it is a major change in the language, i.e.:

rly commented 1 year ago

Yes, that's what I'm proposing, and yes, it would require new code in hdmf and matnwb to maintain compatibility. Would it require a major version bump though? I think that is necessary only if we remove support for the existing way of required and quantity. It might be confusing not to remove support, but we have to support reading both anyway for a while. What do you think about simply adding this more powerful way of describing quantity constraints, and doing a minor version bump?

oruebel commented 1 year ago

simply adding this more powerful way of describing quantity constraints, and doing a minor version bump?

I can see advantages / disadvantages to both

Major Version (allow only new way to define quantity)

➕ Avoid collision between definitions of quantity ➕ Provides single approach for users to specify quantity and provides a clear path for deprecation/retirement of the current approach ➕/➖ Schema definitions will be automatically updated to the new version once folks export ➖ Possibly breaking change in the spec API to not allow setting of quantity for a spec itself (only via the parent)?

Minor Version (allow both the current and new way to define quantity) ➕ Does not require changes for the user ➕ Allows definition of default quantity (using the existing approach) while allowing quantity to be easily overwritten by the parent (using the new approach) ➖ Need to look up definition of quantity in two places, and resolve conflicts between local definition of quantity (using the current approach) and the definition of quantity in the parent (using the proposed approach). I.e., we need to have a clear rule for this. E.g., if the definition in the parent always takes precedence, then the "local" definition of quantity effectively becomes a default value, rather than the actual value. ➖ Possibly confusing for users which approach to use ➖ Effectively implies that both variants will be maintained in the language (i.e., no deprecation path).

Anything else?

rly commented 1 year ago

What do you mean by deprecation path?

I think even if we go the minor version path, we can still provide warnings in the spec API that says that specifying quantity using the existing approach is being deprecated at the next major version. Of course, we need to actually release a major version at some point.

The major version change only approach is certainly cleaner.