hdmf-dev / hdmf

The Hierarchical Data Modeling Framework
http://hdmf.readthedocs.io
Other
47 stars 26 forks source link

[Feature]: Create NWBDatasetIncSpec and NWBGroupIncSpec #884

Open rly opened 1 year ago

rly commented 1 year ago

What would you like to see added to HDMF?

When building extensions, users are often confused over how to define groups and datasets within a parent group. For example, one could define:

Parcellations = NWBGroupSpec(
        default_name='parcellations',
        neurodata_type_def='Parcellations',
        neurodata_type_inc='NWBDataInterface',
        doc='parcellations of this surface',
        datasets=[
            NWBDatasetSpec(
                neurodata_type_def='Parcellation',
                doc='a parcellation of the surface',
                quantity='+'
            )
        ]
    )

While technically legal, this nested type definition is difficult to parse. This code also hides the fact that the Parcellation is now defined and can be reused in other types. As a result, this goes against our schema best practices https://github.com/hdmf-dev/hdmf-schema-language/issues/14

This is the preferred code:

Parcellation = NWBDatasetSpec(
        neurodata_type_def='Parcellation',
        doc='a parcellation of the surface',
    )

Parcellations = NWBGroupSpec(
        default_name='parcellations',
        neurodata_type_def='Parcellations',
        neurodata_type_inc='NWBDataInterface',
        doc='parcellations of this surface',
        datasets=[
            NWBDatasetSpec(
                neurodata_type_inc='Parcellation',
                doc='a parcellation of this surface',
                quantity='+'
            )
        ]
    )

However, NWBDatasetSpec serves two different functions here. The first defines the new type. The second says only instances of this type can be used and at least one must be provided. In the first usage where the type is defined, quantity makes no sense, and name really shouldn't be set. In the second usage, neurodata_type_def should not be used, and default_name doesn't really make sense. Same thing for groups.

This is explained in the language, but honestly, I think this is confusing and the third row of the table is distinctly different from the other three and should be separated out as I describe below. https://hdmf-schema-language.readthedocs.io/en/latest/description.html#data-type-def-and-data-type-inc

Is your feature request related to a problem?

No response

What solution would you like?

To make this distinction easier for users, I suggest adding new classes NWBDatasetIncSpec and NWBGroupIncSpec that are the only types allowed in a NWBGroupSpec definition under groups and datasets. neurodata_type_def is not allowed here. I would also change the keyword to just neurodata_type. We would also need to update the tutorials.

We can also warn when quantity and name are used in a NWBDatasetSpec that is not a NWBDatasetIncSpec (we cannot restrict it because that would break backwards compatibility).

I suggest we also add these new spec types to the hdmf-schema-language.

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

oruebel commented 1 year ago

I suggest adding new classes NWBDatasetIncSpec and NWBGroupIncSpec

That's an interesting proposal. I think from an interface perspective this could work. I think the name IncSpec is not very intuitive, maybe IncludeDatasetSpec? How does this affect definition of datasets/groups that do not have a neurodata_type_def but only have a fixed name? Would those also be defined outsite and then included?

rly commented 1 year ago

I like IncludeDatasetSpec or maybe InnerDatasetSpec. Good point. I forgot that use case. No, those should be allowed in this new spec type.

oruebel commented 1 year ago

I forgot that use case. No, those should be allowed in this new spec type.

Just to clarify. The primary difference then is that neurodata_type_def is not allowed in IncludeDatasetSpec. In some sense, this is the difference then between defining a class and instantiating an instance of it (only the term instantiating a type would be confusing here).

For name, I think describing the "operation" (e.g., include) rather than location (e.g., inner) seems more intuitive to me. The name IncludeDatasetSpec is not bad, but might be confused with the include term in a program language sense of an include in C/C++ (which may be confusing if the class allows creating datasets specs with fixed name and no type). How about maybe InsertDatasetSpec?

rly commented 10 months ago

Just noting another instance where this IncludeDatasetSpec (or whatever we want to call it) class, automatic checks for nested type definitions, and appropriate documentation around these would have prevented copied and nested spec definitions: https://github.com/nwb-extensions/staged-extensions/pull/44

I will bump this issue up in priority because we will only see more of these instances occur.