Support specification of at least one of {x, y, z}

rly commented 1 year ago

This comes up in NWB:

TimeSeries requires at least one of starting_time and rate
PlaneSegmentation requires at least one of pixel_mask, image_mask, voxel_mask

oruebel commented 1 year ago

Constraints

Focus on definitions within a data_dtype definition. I.e., do not support constraints on the global schema level
Do not support modification of components on existing data_type on data_type_inc, only of the data type itself. E.g,. if TimeSeries is included we can modify whether the TimeSeries is required but we cannot change whether TimeSeries.data should be required

Option 1

Support expressions in the required field, e.g., required: if not image_mask and not voxel_mask

Specifies the requirement with the object itself, so we don't need to modify specifications externally
Explicitly expresses constraints with each dataset/group
Flexible, in that constraints are specific to a single dataset/group
More verbose. Requires explicit definition of the constraint on each dataset.
Potentially requires referencing objects in parent groups, i.e., will need "." syntax to go to parents
Potentially involved to parse (depending on how actual constraints are defined).

Option 2

Define constraint at the parent level, e.g.: constraints: oneOf(pixel_mask, image_mask, voxel_mask)

Compact. Constraint defined in a single place.
Modifies the required field of contained objects, so it is not sufficient to just look at the required definition of the object itself (unless we have a way to reference that constraint)
Possibly complex to parse constraints (depending on how they are defined)

Option 3 (suggested by @rly below)

Define specific keys for defining constraints, e.g., anyOf, oneOf, and allOf.

Compact. Constraint defined in a single place.
Simple to parse, as it avoids complex expression and only requires listing of objects that are part of the particular constraint.
Consistent with JSON schema and Link ML (see comment below)
Modifies the required field of contained objects, so it is not sufficient to just look at the required definition of the object itself (unless we have a way to reference that constraint)

Comment:

Constraints in Option 2 could be defined using the approach described in Option 3. An advantage of combining the the would be that all constraints would be listed as part of an explicit key, rather than possibly may anyOf, oneOf etc. constraints cluttering the main definition.
Similarly, Option 3 and 1 could be combined by expressing the required field on individual object using the keys defined in Option 3

rly commented 1 year ago

Related solutions:

JSON schema has the anyOf, oneOf, and allOf keys, each of which takes a json (sub)schema. Here is a (meta) example from the JSON schema that is used to validate the NWB schema: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/dev/nwb.schema.json#L250

LinkML also has the any_of, exactly_one_of, and all_of slots, each of which takes a set of expressions

oruebel commented 1 year ago

Related solutions:

Thanks @rly . That seems to be variant on option 2 with the constraint being defined by the key. I'll update the options list above just so we have the options in one place.

oruebel commented 1 year ago

Looking at the options, I think there are two distinct design decisions to make:

Where should the required constraints be specified: i) as part of the required key of each object, or ii) outside as part of the parent object that contains the relevant object
How should the contains be expressed: i) Via keys acting as functions (e.g., anyOf, oneOf etc.) or ii) via conditional expressions, e.g., if not (pixel_mask, voxel_mask)

rly commented 1 year ago

In the extreme, where there are many keys, Option 1 is clunky. Each dataset needs to specify a unique constraint compared to the other datasets. Adding or removing one key requires editing them all. Consistency is difficult to maintain. "one of {x, y, z}" is defined by a combination of 3 different "required: if not x and not y", "required: if not y and not z", and "required: if not x and not z" specifications. It is more difficult to understand what is going on.

Options 2 and 3 are more intuitive to a naive reader in my opinion, and to people familiar with json schema, which is fairly popular. I prefer to defer to existing solutions if they suit our needs. Having "required" be a part of the key never really made sense to me. Same as "quantity". These are fields that specify the relationship between the parent and the child. Rarely do we ever look at a child in isolation and when we do, it does not make sense to define "required" on it, because it is a property of the parent-child relationship. For example, a standalone data type def should never be defined with a "required" field. As such, I think it is OK to have "oneOf" be part of the parent, adjacent to the child. I think we should consider moving "required" outside the child as well if we end up making this change...

oruebel commented 1 year ago

As such, I think it is OK to have "oneOf" be part of the parent, adjacent to the child. I think we should consider moving "required" outside the child as well if we end up making this change...

If I understand your suggestion correctly you are proposing to have quantities (and for attributes required) properties be specified in the parent that "owns" the object rather than by the object itself. I.e., the quantities for datasets would be specified by the group that contains the dataset, rather than by the dataset itself. I.e., something like:

groups:
- neurodata_type_def: DataSeriesExample
  neurodata_type_inc: TimeSeries
  doc: ...
  attributes:
  - name: data_gain
    dtype: float32
    doc: data gain.
  datasets:
  - name: field_xyz
    dtype: float32
    dims:
    - x|y|z
    shape:
    - 3
    doc: x,y,z in meters
 - name: field_xy
    dtype: float32
    dims:
    - x|y
    shape:
    - 3
    doc: x,y in meters
 quantities:
    - oneOf:
      - - field_xyz 
        - field_xy
      - ...
    - required:
      - data_gain
    - optional:
    -  ...

In principle, I think this is a reasonable approach. The main challenge with such an approach is compatibility, since it is a major change in the language, i.e.:

we'd need to have a major version bump of the HDMF schema language
depending on the version of the language define quantities differently
have extra code to handle the two cases, e.g., to allow users to access the information via the DatasetSpec.quantity property as usual even if that property is now defined in the parent and to deal with cases of mixed schema, e.g., an extension using an older version of the schema language than the core NWB schema.

rly commented 1 year ago

Yes, that's what I'm proposing, and yes, it would require new code in hdmf and matnwb to maintain compatibility. Would it require a major version bump though? I think that is necessary only if we remove support for the existing way of required and quantity. It might be confusing not to remove support, but we have to support reading both anyway for a while. What do you think about simply adding this more powerful way of describing quantity constraints, and doing a minor version bump?

oruebel commented 1 year ago

simply adding this more powerful way of describing quantity constraints, and doing a minor version bump?

I can see advantages / disadvantages to both

Major Version (allow only new way to define quantity)

➕ Avoid collision between definitions of quantity ➕ Provides single approach for users to specify quantity and provides a clear path for deprecation/retirement of the current approach ➕/➖ Schema definitions will be automatically updated to the new version once folks export ➖ Possibly breaking change in the spec API to not allow setting of quantity for a spec itself (only via the parent)?

Minor Version (allow both the current and new way to define quantity) ➕ Does not require changes for the user ➕ Allows definition of default quantity (using the existing approach) while allowing quantity to be easily overwritten by the parent (using the new approach) ➖ Need to look up definition of quantity in two places, and resolve conflicts between local definition of quantity (using the current approach) and the definition of quantity in the parent (using the proposed approach). I.e., we need to have a clear rule for this. E.g., if the definition in the parent always takes precedence, then the "local" definition of quantity effectively becomes a default value, rather than the actual value. ➖ Possibly confusing for users which approach to use ➖ Effectively implies that both variants will be maintained in the language (i.e., no deprecation path).

Anything else?

rly commented 1 year ago

What do you mean by deprecation path?

I think even if we go the minor version path, we can still provide warnings in the spec API that says that specifying quantity using the existing approach is being deprecated at the next major version. Of course, we need to actually release a major version at some point.

The major version change only approach is certainly cleaner.

hdmf-dev / hdmf-schema-language

Support specification of at least one of {x, y, z} #17