FYI: Adding an `identier:true` property in a derived class does not work

mih commented 9 months ago

It took me a day of poking to identify the cause of a whole slew of weird errors (some in loading, some in validation). It turns out that adding a slot to a derived class does not work smoothly when that slot has identifier:true. The failure is not obvious or superficial, though. I noticed it when trying to reference an instance of the derived class via that identifier, and it failed to recognize it as an instance of that class, and instead considered the base class type only (which did not have that identifier slot).

Moving the identifier slot to the base class "fixed" it. This is not optimal, but tolerable for the model at hand.

mih commented 8 months ago

OK, so this took me a while to get back to. I had written up https://github.com/linkml/linkml/issues/1812 at the end of the last iteration, hoping that it would shed light on the problem, but no luck.

With a fresh perspective I believe I found a way to achieve what I wanted -- without straining linkml too much. The following is a summary of the problem, and a sketch of the solution. Once a data model is ready, a PR with code will close this issue (but it needs some cleaning to make it digestible, still).

The development target here is two-fold:

data model/schema to validate and transform actual records
"ontology" with essential concepts and their alignment to standard vocabularies

I believe that the problems I saw were caused by linkml not wanting to be a tool to be used for the 2nd target. From its docs:

OWL is used for building ontologies, whereas LinkML is a schema language.

Nevertheless, I continue to find it useful to stick to linkml for both parts, because our needs are relatively modest in comparison to standard OWL use cases, and using one tool chain, like linkml, is already an expensive addition -- little desire to add another one.

The technical description of the problem I ran into is in the issue linked above, and I will not discuss it again here. Instead I will focus on describing my current mental concept of the solution -- hoping that this helps to avoid similar dead-ends for further additions. It seems to all come down to this:

Separate the definition of data semantics from data structure in (two) classes

I cannot yet comprehend the full picture and implications of this statement, so I will go by examples.

Separate schema definitions from "ontology" components

Ideally a particular concept is represented by a single class in an ontology. However, data records representing instances of such concepts can come in many forms. For example, there can be a Dataset and a DatasetVersion class, where the former is a version-less concept and the later is a particular version of such a dataset, of which there could be many. One way to represent this could be a Dataset class with a has_version property that is multi-valued, and has DatasetVersion instances inlined. Or maybe the DatasetVersion has a single globally unique identifier (in the linkml sense) and instances are inlined, but not as a list, but via their identifier key, or DatasetVersion could be the toplevel class that has a is_version_of property that links or inlines a Dataset instance. These are all valid and sensible ways to represent data that depends on the specifics of a use case -- still, only two concepts are involved.

A large number of distinct (data structure) schemas can be defined based on few(er) concepts. It makes sense to keep them separate for efficiency and clarity.

Only define data structure and linkage mechanics in schemas (not concepts)

In order to not constrain reuse of concepts, never declare how two entities are connected in a concept class. For example, there should never be a inlined(_as_list) or even identifier:true declaration. in such classes. This should be done in dedicated classes that tailor a concept for a schema.

Use `mixins` for concept classes (exclusively)

linkml wants derivations (is_a) to be homogeneous inside or outside a class hierarchy of mixins.

https://linkml.io/linkml/faq/modeling.html#what-is-the-difference-between-is-a-and-mixins

Because only a "downstream" schema class should define things like identifier:true for slots, and linkml seemingly does not support that for standard class derivation, concept classes should all be mixins. This enables a normal class hierarchy on the concept/ontology side of things (via is_a and/or mixins). A separate class hierarchy can/should be used for schema classes. The connection between the two hierarchies will be a mixins declaration. So for example, DataladDatasetVersion is a concept (mixin) class (derived from a DCAT Dataset and maybe some other interfaces/alignments). A schema that defines one way to write a metadata record on a DataLad dataset version would then declare something like this:

classes:
  DataladDatasetVersionSE:
    mixins:
      - DataladDatasetVersion
    slot_usage:
      id:
        identifier: true
        equals_expression: annex:{uuid}

The SE suffix might stand for "schema element" or "structure element", whatever. It is merely a consequence of having to have a unique class name. The only purpose of this class is to declare an id property that is an identifier in the linkml sense (e.g. highlander-style "there can only be one"), and to do this in a "downstream" class that avoids forcing the requirement of an identifier on other consumers of the concept class DataladDatasetVersion.

mih commented 8 months ago

The approach sketched above and now implemented in main seems to solve this issue elegantly enough.

Closing.

psychoinformatics-de / datalad-concepts