Open rly opened 1 year ago
Constraints
data_dtype
definition. I.e., do not support constraints on the global schema leveldata_type
on data_type_inc
, only of the data type itself. E.g,. if TimeSeries
is included we can modify whether the TimeSeries is required but we cannot change whether TimeSeries.data
should be required Option 1
Support expressions in the required
field, e.g., required: if not image_mask and not voxel_mask
Option 2
Define constraint at the parent level, e.g.: constraints: oneOf(pixel_mask, image_mask, voxel_mask)
Option 3 (suggested by @rly below)
Define specific keys for defining constraints, e.g., anyOf
, oneOf
, and allOf
.
Comment:
anyOf
, oneOf
etc. constraints cluttering the main definition. Related solutions:
JSON schema has the anyOf
, oneOf
, and allOf
keys, each of which takes a json (sub)schema. Here is a (meta) example from the JSON schema that is used to validate the NWB schema: https://github.com/NeurodataWithoutBorders/nwb-schema/blob/dev/nwb.schema.json#L250
LinkML also has the any_of
, exactly_one_of
, and all_of
slots, each of which takes a set of expressions
Related solutions:
Thanks @rly . That seems to be variant on option 2 with the constraint being defined by the key. I'll update the options list above just so we have the options in one place.
Looking at the options, I think there are two distinct design decisions to make:
anyOf
, oneOf
etc.) or ii) via conditional expressions, e.g., if not (pixel_mask, voxel_mask)
In the extreme, where there are many keys, Option 1 is clunky. Each dataset needs to specify a unique constraint compared to the other datasets. Adding or removing one key requires editing them all. Consistency is difficult to maintain. "one of {x, y, z}" is defined by a combination of 3 different "required: if not x and not y", "required: if not y and not z", and "required: if not x and not z" specifications. It is more difficult to understand what is going on.
Options 2 and 3 are more intuitive to a naive reader in my opinion, and to people familiar with json schema, which is fairly popular. I prefer to defer to existing solutions if they suit our needs. Having "required" be a part of the key never really made sense to me. Same as "quantity". These are fields that specify the relationship between the parent and the child. Rarely do we ever look at a child in isolation and when we do, it does not make sense to define "required" on it, because it is a property of the parent-child relationship. For example, a standalone data type def should never be defined with a "required" field. As such, I think it is OK to have "oneOf" be part of the parent, adjacent to the child. I think we should consider moving "required" outside the child as well if we end up making this change...
As such, I think it is OK to have "oneOf" be part of the parent, adjacent to the child. I think we should consider moving "required" outside the child as well if we end up making this change...
If I understand your suggestion correctly you are proposing to have quantities (and for attributes required) properties be specified in the parent that "owns" the object rather than by the object itself. I.e., the quantities for datasets would be specified by the group that contains the dataset, rather than by the dataset itself. I.e., something like:
groups:
- neurodata_type_def: DataSeriesExample
neurodata_type_inc: TimeSeries
doc: ...
attributes:
- name: data_gain
dtype: float32
doc: data gain.
datasets:
- name: field_xyz
dtype: float32
dims:
- x|y|z
shape:
- 3
doc: x,y,z in meters
- name: field_xy
dtype: float32
dims:
- x|y
shape:
- 3
doc: x,y in meters
quantities:
- oneOf:
- - field_xyz
- field_xy
- ...
- required:
- data_gain
- optional:
- ...
In principle, I think this is a reasonable approach. The main challenge with such an approach is compatibility, since it is a major change in the language, i.e.:
DatasetSpec.quantity
property as usual even if that property is now defined in the parent and to deal with cases of mixed schema, e.g., an extension using an older version of the schema language than the core NWB schema. Yes, that's what I'm proposing, and yes, it would require new code in hdmf and matnwb to maintain compatibility. Would it require a major version bump though? I think that is necessary only if we remove support for the existing way of required
and quantity
. It might be confusing not to remove support, but we have to support reading both anyway for a while. What do you think about simply adding this more powerful way of describing quantity constraints, and doing a minor version bump?
simply adding this more powerful way of describing quantity constraints, and doing a minor version bump?
I can see advantages / disadvantages to both
Major Version (allow only new way to define quantity)
➕ Avoid collision between definitions of quantity ➕ Provides single approach for users to specify quantity and provides a clear path for deprecation/retirement of the current approach ➕/➖ Schema definitions will be automatically updated to the new version once folks export ➖ Possibly breaking change in the spec API to not allow setting of quantity for a spec itself (only via the parent)?
Minor Version (allow both the current and new way to define quantity) ➕ Does not require changes for the user ➕ Allows definition of default quantity (using the existing approach) while allowing quantity to be easily overwritten by the parent (using the new approach) ➖ Need to look up definition of quantity in two places, and resolve conflicts between local definition of quantity (using the current approach) and the definition of quantity in the parent (using the proposed approach). I.e., we need to have a clear rule for this. E.g., if the definition in the parent always takes precedence, then the "local" definition of quantity effectively becomes a default value, rather than the actual value. ➖ Possibly confusing for users which approach to use ➖ Effectively implies that both variants will be maintained in the language (i.e., no deprecation path).
Anything else?
What do you mean by deprecation path?
I think even if we go the minor version path, we can still provide warnings in the spec API that says that specifying quantity using the existing approach is being deprecated at the next major version. Of course, we need to actually release a major version at some point.
The major version change only approach is certainly cleaner.
This comes up in NWB:
starting_time
andrate
pixel_mask
,image_mask
,voxel_mask