Closed mih closed 8 months ago
OK, so this took me a while to get back to. I had written up https://github.com/linkml/linkml/issues/1812 at the end of the last iteration, hoping that it would shed light on the problem, but no luck.
With a fresh perspective I believe I found a way to achieve what I wanted -- without straining linkml too much. The following is a summary of the problem, and a sketch of the solution. Once a data model is ready, a PR with code will close this issue (but it needs some cleaning to make it digestible, still).
The development target here is two-fold:
I believe that the problems I saw were caused by linkml not wanting to be a tool to be used for the 2nd target. From its docs:
OWL is used for building ontologies, whereas LinkML is a schema language.
Nevertheless, I continue to find it useful to stick to linkml for both parts, because our needs are relatively modest in comparison to standard OWL use cases, and using one tool chain, like linkml, is already an expensive addition -- little desire to add another one.
The technical description of the problem I ran into is in the issue linked above, and I will not discuss it again here. Instead I will focus on describing my current mental concept of the solution -- hoping that this helps to avoid similar dead-ends for further additions. It seems to all come down to this:
I cannot yet comprehend the full picture and implications of this statement, so I will go by examples.
Ideally a particular concept is represented by a single class in an ontology. However, data records representing instances of such concepts can come in many forms. For example, there can be a Dataset
and a DatasetVersion
class, where the former is a version-less concept and the later is a particular version of such a dataset, of which there could be many. One way to represent this could be a Dataset
class with a has_version
property that is multi-valued, and has DatasetVersion
instances inlined. Or maybe the DatasetVersion
has a single globally unique identifier (in the linkml sense) and instances are inlined, but not as a list, but via their identifier key, or DatasetVersion
could be the toplevel class that has a is_version_of
property that links or inlines a Dataset
instance. These are all valid and sensible ways to represent data that depends on the specifics of a use case -- still, only two concepts are involved.
A large number of distinct (data structure) schemas can be defined based on few(er) concepts. It makes sense to keep them separate for efficiency and clarity.
In order to not constrain reuse of concepts, never declare how two entities are connected in a concept class. For example, there should never be a inlined(_as_list)
or even identifier:true
declaration. in such classes. This should be done in dedicated classes that tailor a concept for a schema.
mixins
for concept classes (exclusively)linkml wants derivations (is_a
) to be homogeneous inside or outside a class hierarchy of mixins.
https://linkml.io/linkml/faq/modeling.html#what-is-the-difference-between-is-a-and-mixins
Because only a "downstream" schema class should define things like identifier:true
for slots, and linkml seemingly does not support that for standard class derivation, concept classes should all be mixins
. This enables a normal class hierarchy on the concept/ontology side of things (via is_a
and/or mixins
). A separate class hierarchy can/should be used for schema classes. The connection between the two hierarchies will be a mixins
declaration. So for example, DataladDatasetVersion
is a concept (mixin) class (derived from a DCAT Dataset
and maybe some other interfaces/alignments). A schema that defines one way to write a metadata record on a DataLad dataset version would then declare something like this:
classes:
DataladDatasetVersionSE:
mixins:
- DataladDatasetVersion
slot_usage:
id:
identifier: true
equals_expression: annex:{uuid}
The SE
suffix might stand for "schema element" or "structure element", whatever. It is merely a consequence of having to have a unique class name. The only purpose of this class is to declare an id
property that is an identifier
in the linkml sense (e.g. highlander-style "there can only be one"), and to do this in a "downstream" class that avoids forcing the requirement of an identifier on other consumers of the concept class DataladDatasetVersion
.
The approach sketched above and now implemented in main seems to solve this issue elegantly enough.
Closing.
It took me a day of poking to identify the cause of a whole slew of weird errors (some in loading, some in validation). It turns out that adding a
slot
to a derived class does not work smoothly when that slot hasidentifier:true
. The failure is not obvious or superficial, though. I noticed it when trying to reference an instance of the derived class via that identifier, and it failed to recognize it as an instance of that class, and instead considered the base class type only (which did not have that identifier slot).Moving the identifier slot to the base class "fixed" it. This is not optimal, but tolerable for the model at hand.