hdmf-dev / hdmf

The Hierarchical Data Modeling Framework
http://hdmf.readthedocs.io
Other
46 stars 26 forks source link

nested type definitions #73

Open bendichter opened 5 years ago

bendichter commented 5 years ago

There are two ways to define a new Container type in the spec language. You can either define a Container type in the /groups of the schema with a _def field (and optionally an _inc) and then use it in another place with an _inc field (and no _def field), or you can define it and use it at the same time by including _def and _inc fields in a nested spot of the schema. Most types are defined in /groups and /datasets and included separately, but there are cases in the NWB:N core schema where types are defined as they are used, e.g.

https://github.com/NeurodataWithoutBorders/pynwb/blob/a081b39803f0ffb535b2198a2fdd81007ffa442e/src/pynwb/data/nwb.file.yaml#L234-L243

As far as I can tell, this capability is purely for convenience and offers no additional features or expressive power for the defined types. Currently our get_class (i.e. type_map.get_container_cls in hdmf) function only works at the surface level (in '/groups') and will not register nested Container definitions. As a result, there are schemas that follow the schema language rules, but for which the API auto-generation will not work. I see a few possible solutions.

Here is a minimal example. The last 4 commands are the critical piece.

import os
import unittest2 as unittest
import tempfile
import warnings
import numpy as np

from hdmf.utils import docval, getargs
from hdmf.data_utils import DataChunkIterator
from hdmf.backends.hdf5.h5tools import HDF5IO
from hdmf.backends.hdf5 import H5DataIO
from hdmf.build import DatasetBuilder, BuildManager, TypeMap, ObjectMapper
from hdmf.spec.namespace import NamespaceCatalog
from hdmf.spec.spec import AttributeSpec, DatasetSpec, GroupSpec, ZERO_OR_MANY, ONE_OR_MANY
from hdmf.spec.namespace import SpecNamespace
from hdmf.spec.catalog import SpecCatalog
from hdmf.container import Container
from h5py import SoftLink, HardLink, ExternalLink, File

from tests.unit.test_utils import Foo, FooBucket, CORE_NAMESPACE

foo_spec = GroupSpec('A test group specification with a data type',
                     data_type_def='Foo',
                     datasets=[DatasetSpec('an example dataset',
                                           'int',
                                           name='my_data',
                                           attributes=[AttributeSpec('attr2',
                                                                     'an example integer attribute',
                                                                     'int')])],
                     attributes=[AttributeSpec('attr1', 'an example string attribute', 'text')])

tmp_spec = GroupSpec('A subgroup for Foos',
                     name='foo_holder',
                     groups=[GroupSpec('the Foos in this bucket', data_type_inc='Foo', quantity=ZERO_OR_MANY)])

bucket_spec = GroupSpec('A test group specification for a data type containing data type',
                        data_type_def='FooBucket',
                        groups=[tmp_spec])

class BucketMapper(ObjectMapper):
    def __init__(self, spec):
        super(BucketMapper, self).__init__(spec)
        foo_spec = spec.get_group('foo_holder').get_data_type('Foo')
        self.map_spec('foos', foo_spec)

file_spec = GroupSpec("A file of Foos contained in FooBuckets",
                      name='root',
                      data_type_def='FooFile',
                      groups=[GroupSpec('Holds the FooBuckets',
                                        name='buckets',
                                        groups=[GroupSpec("One ore more FooBuckets",
                                                          data_type_inc='FooBucket',
                                                          quantity=ONE_OR_MANY)])])

class FileMapper(ObjectMapper):
    def __init__(self, spec):
        super(FileMapper, self).__init__(spec)
        bucket_spec = spec.get_group('buckets').get_data_type('FooBucket')
        self.map_spec('buckets', bucket_spec)

spec_catalog = SpecCatalog()
spec_catalog.register_spec(foo_spec, 'test.yaml')
spec_catalog.register_spec(bucket_spec, 'test.yaml')
spec_catalog.register_spec(file_spec, 'test.yaml')
namespace = SpecNamespace(
    'a test namespace',
    CORE_NAMESPACE,
    [{'source': 'test.yaml'}],
    catalog=spec_catalog)
namespace_catalog = NamespaceCatalog()
namespace_catalog.add_namespace(CORE_NAMESPACE, namespace)
type_map = TypeMap(namespace_catalog)

type_map.register_container_type(CORE_NAMESPACE, 'Foo', Foo)
type_map.register_container_type(CORE_NAMESPACE, 'FooBucket', FooBucket)
type_map.register_container_type(CORE_NAMESPACE, 'FooFile', FooFile)

type_map.register_map(FooBucket, BucketMapper)
type_map.register_map(FooFile, FileMapper)

manager = BuildManager(type_map)

spec_catalog = manager.namespace_catalog.get_namespace(CORE_NAMESPACE).catalog
foo_spec = spec_catalog.get_spec('Foo')
# Baz1 class contains an object of Baz2 class
baz_spec2 = GroupSpec('A composition inside',
                      data_type_def='Baz2',
                      data_type_inc=foo_spec,
                      attributes=[
                          AttributeSpec('attr3', 'an example float attribute', 'float'),
                          AttributeSpec('attr4', 'another example float attribute', 'float')])

baz_spec1 = GroupSpec('A composition test outside',
                      data_type_def='Baz1',
                      data_type_inc=foo_spec,
                      attributes=[AttributeSpec('attr3', 'an example float attribute', 'float'),
                                  AttributeSpec('attr4', 'another example float attribute', 'float')],
                      groups=[baz_spec2])

# add directly into the existing spec_catalog. would not do this normally.
spec_catalog.register_spec(baz_spec1, 'test.yaml')

Baz2 = manager.type_map.get_container_cls(CORE_NAMESPACE, 'Baz2')

1) Inform users that if they want to auto-generate the API, all type definitions must be in /groups and /datasets. This is a limitation only in style and not in function. The biggest downside here is that the rules used for writing extensions will be slightly more strict than those for the core (again, only stylistically).

2) You could see this freedom of the schema language as a bug, since it provides multiple right ways to do something without offering any benefit. We could change the schema language rules so that you can only define types at the surface and change the schema to match. This should not cause any compatibility issues for the core, but would raise some eyebrows. If we are going to to this we should do it now before community extensions start to accumulate.

3) We could build logic to search the tree for new types.

Thoughts?

Checklist

oruebel commented 5 years ago
1. Inform users ...

Having a "best practice" that types should be defined at the top level, seems reasonable either way.

2\. You could see this freedom of the schema language as a bug ...

From the perspective of the language I don't see why this would be considered a bug. While specifying new types at the top level may be a good practice, being able to define new types in nested levels seems perfectly reasonable. However, I understand that restricting it to the top-level may help simplify implementation of the schema language.

3\. We could build logic to search the tree for new types.

My current inclination is that this is probably what we should do.