linkml / linkml

Linked Open Data Modeling Language
https://linkml.io/linkml
Other
326 stars 101 forks source link

Support for `identifier:true` properties starting from derived classes only? #1812

Open mih opened 10 months ago

mih commented 10 months ago

Disclaimer: I am very new to linkml, and worked myself through docs and examples over the past weeks.

I am trying to compose a schema for describing datasets that are tracked with Git/git-annex. For compatibility with broader infrastructures, the schema is based on the DCAT v3 model and has corresponding classes. The DCAT concepts do not require globally unique identifiers (in the linkml identifier:true sense). However, in the Git world, everything tracked does have such an identifier. Consequently, the schema wants to declare such property and also use it in data instances.

Specifically, two variants of the "qualified relation" pattern are needed. One (QualifiedAccess) to add an access_id for retrieving a resource, and another (QualifiedPart) for declaring a location of a dataset part within a dataset.

I apologize for the complexity of the linkml schema below, but it is the smallest extract of the actual schema that I could come up with that still shows my problem in both ways. For readability I removed all descriptions and mappings.

When I convert the example below to RDF, I get:

❯ linkml-convert -s MonoDataladDatasetVersion-schema.yaml --target-class-from-path MonoDataladDatasetVersion-example.yaml -t rdf
<traceback>
TypeError: test.Resource() argument after ** must be a mapping, not FileInGitMetaId

When the last line in the example is commented out (relation: gitsha:...) the conversion works, but is then missing the critical link.

The "same" pattern (link by ID) works for the qualified_access specification. However, in order to make it work, I had to declare the respective identifier:true slot in the base class. This is problematic, because it is thereby required:true, although in general matching data instances do not have an appropariate identifier.

For the qualified_part approach, I tried injecting the identifier via a mixin. Evidently, this is not working.

My questions is now: Am I doing it wrong? Conceptually or technically? Or is this a linkml limitation?

I suspect that something (else) is fishy with my schema, because running it through gen-linkml and then converting with the generated schema gives a suspiciously related error:

❯ gen-linkml MonoDataladDatasetVersion-schema.yaml -f yaml > merged-schema.yaml
❯ linkml-convert -s merged-schema.yaml --target-class-from-path MonoDataladDatasetVersion-example.yaml -t rdf
<traceback>
  File "/home/mih/env/datalad-dev/lib/python3.11/site-packages/linkml/utils/generator.py", line 239, in _initialize_using_schemaloader
    loader.resolve()
  File "/home/mih/env/datalad-dev/lib/python3.11/site-packages/linkml/utils/schemaloader.py", line 510, in resolve
    self.raise_value_error(
  File "/home/mih/env/datalad-dev/lib/python3.11/site-packages/linkml/utils/schemaloader.py", line 940, in raise_value_error
    SchemaLoader.raise_value_errors(error, loc_str)
  File "/home/mih/env/datalad-dev/lib/python3.11/site-packages/linkml/utils/schemaloader.py", line 948, in raise_value_errors
    raise ValueError(f'{TypedNode.yaml_loc(loc_str, suffix="")} {error}')
ValueError:  Class "GitTracked" - multiple keys/identifiers not allowed (meta_id, gitTracked__meta_id)

Thanks in advance for your time!

Schema `MonoDataladDatasetVersion-schema.yaml` ```yaml id: https://example.com/reproducer name: reproducer prefixes: annex: https://concepts.datalad.org/namespace/annex-uuid/ DCAT: http://www.w3.org/ns/dcat# dct: http://purl.org/dc/terms/ dlco: https://concepts.datalad.org/ontology/ gitsha: https://concepts.datalad.org/namespace/gitsha/ linkml: https://w3id.org/linkml/ prov: http://www.w3.org/ns/prov# xsd: http://www.w3.org/2001/XMLSchema# default_prefix: dlco imports: - linkml:types types: PosixRelPath: uri: dlco:PosixRelPath base: str SHA1: uri: dlco:sha1 base: str UUID: uri: http://purl.obolibrary.org/obo/NCIT_C54100 base: str slots: access_id: range: string at_location: slot_uri: prov:atLocation range: Location distribution: range: Distribution endpoint_url: range: uri gitsha: range: SHA1 has_annex_remote: range: AnnexRemote has_part: slot_uri: dct:hasPart meta_id: identifier: true range: uriorcurie meta_type: designates_type: true range: uriorcurie qualified_access: range: QualifiedAccess qualified_part: range: QualifiedPart relation: slot_uri: dct:relation uuid: range: UUID classes: Location: class_uri: prov:Location MetaObject: class_uri: linkml:Any GitTracked: mixin: true slots: - gitsha - meta_id slot_usage: gitsha: required: true Resource: class_uri: DCAT:Resource slots: - has_part - meta_type - qualified_part slot_usage: has_part: range: Resource multivalued: true qualified_part: multivalued: true inlined: true inlined_as_list: true Dataset: is_a: Resource slots: - distribution Distribution: slots: - qualified_access AnnexDistribution: is_a: Distribution slot_usage: qualified_access: range: QualifiedAnnexAccess MonoDataladDatasetVersion: is_a: Dataset slots: - has_annex_remote slot_usage: has_annex_remote: multivalued: true inlined: true has_part: range: FileInGit multivalued: true inlined: true inlined_as_list: true qualified_part: range: QualifiedGitTrackedPart multivalued: true inlined: true inlined_as_list: true QualifiedAccess: slots: - access_id - relation slot_usage: relation: range: DataService QualifiedAnnexAccess: is_a: QualifiedAccess slot_usage: relation: range: AnnexRemote QualifiedPart: slots: - relation - at_location slot_usage: at_location: range: PosixRelPath relation: range: Resource QualifiedGitTrackedPart: is_a: QualifiedPart slot_usage: relation: range: FileInGit File: is_a: Resource slots: - distribution FileInGit: is_a: File mixins: - GitTracked AnnexedFile: is_a: FileInGit slot_usage: distribution: range: AnnexDistribution DataService: slots: - endpoint_url # although we do not expect any data service to have a unique identifier # we must add this slow here, rather than in derived classes, due to # a potential linkml limitation/bug # https://github.com/psychoinformatics-de/datalad-concepts/issues/30 - meta_id AnnexRemote: is_a: DataService slots: - uuid slot_usage: meta_id: required: true ```
Example `MonoDataladDatasetVersion-example.yaml` ```yaml has_annex_remote: annex:7e0bf3e7-7d46-4093-813e-b4009826c3bf: uuid: 7e0bf3e7-7d46-4093-813e-b4009826c3bf has_part: gitsha:b94ef1797f7bfc1ac979be122e1b538bbb0d1d58: meta_type: dlco:AnnexedFile gitsha: b94ef1797f7bfc1ac979be122e1b538bbb0d1d58 distribution: qualified_access: access_id: MD5E-s3425--32a617360d10e3dcbfdd0885e8d64ab8.txt relation: annex:7e0bf3e7-7d46-4093-813e-b4009826c3bf qualified_part: - at_location: README.txt # comment out the following line to get a working conversion relation: gitsha:b94ef1797f7bfc1ac979be122e1b538bbb0d1d58 ```
❯ pip freeze | grep linkml
linkml==1.6.7
linkml-dataops==0.1.0
linkml-runtime==1.6.3
yarikoptic commented 9 months ago

FWIW example doesn't validate given provided schema

❯ linkml-validate -s MonoDataladDatasetVersion-schema.yaml --target-class MonoDataladDatasetVersion MonoDataladDatasetVersion-example.yaml
[ERROR] [MonoDataladDatasetVersion-example.yaml/0] {'gitsha:b94ef1797f7bfc1ac979be122e1b538bbb0d1d58': {'meta_type': 'dlco:AnnexedFile', 'gitsha': 'b94ef1797f7bfc1ac979be122e1b538bbb0d1d58', 'distribution': {'qualified_access': {'access_id': 'MD5E-s3425--32a617360d10e3dcbfdd0885e8d64ab8.txt', 'relation': 'annex:7e0bf3e7-7d46-4093-813e-b4009826c3bf'}}}} is not of type 'array' in /has_part

and requires fix to schema (or change to example I guess):

diff --git a/MonoDataladDatasetVersion-schema.yaml b/MonoDataladDatasetVersion-schema.yaml
index 8b85b1e..7357ded 100644
--- a/MonoDataladDatasetVersion-schema.yaml
+++ b/MonoDataladDatasetVersion-schema.yaml
@@ -127,7 +127,7 @@ classes:
         range: FileInGit
         multivalued: true
         inlined: true
-        inlined_as_list: true
+        inlined_as_list: false
       qualified_part:
         range: QualifiedGitTrackedPart
         multivalued: true
here is patched schema ```yaml id: https://example.com/reproducer name: reproducer prefixes: annex: https://concepts.datalad.org/namespace/annex-uuid/ DCAT: http://www.w3.org/ns/dcat# dct: http://purl.org/dc/terms/ dlco: https://concepts.datalad.org/ontology/ gitsha: https://concepts.datalad.org/namespace/gitsha/ linkml: https://w3id.org/linkml/ prov: http://www.w3.org/ns/prov# xsd: http://www.w3.org/2001/XMLSchema# default_prefix: dlco imports: - linkml:types types: PosixRelPath: uri: dlco:PosixRelPath base: str SHA1: uri: dlco:sha1 base: str UUID: uri: http://purl.obolibrary.org/obo/NCIT_C54100 base: str slots: access_id: range: string at_location: slot_uri: prov:atLocation range: Location distribution: range: Distribution endpoint_url: range: uri gitsha: range: SHA1 has_annex_remote: range: AnnexRemote has_part: slot_uri: dct:hasPart meta_id: identifier: true range: uriorcurie meta_type: designates_type: true range: uriorcurie qualified_access: range: QualifiedAccess qualified_part: range: QualifiedPart relation: slot_uri: dct:relation uuid: range: UUID classes: Location: class_uri: prov:Location MetaObject: class_uri: linkml:Any GitTracked: mixin: true slots: - gitsha - meta_id slot_usage: gitsha: required: true Resource: class_uri: DCAT:Resource slots: - has_part - meta_type - qualified_part slot_usage: has_part: range: Resource multivalued: true qualified_part: multivalued: true inlined: true inlined_as_list: true Dataset: is_a: Resource slots: - distribution Distribution: slots: - qualified_access AnnexDistribution: is_a: Distribution slot_usage: qualified_access: range: QualifiedAnnexAccess MonoDataladDatasetVersion: is_a: Dataset slots: - has_annex_remote slot_usage: has_annex_remote: multivalued: true inlined: true has_part: range: FileInGit multivalued: true inlined: true inlined_as_list: false qualified_part: range: QualifiedGitTrackedPart multivalued: true inlined: true inlined_as_list: true QualifiedAccess: slots: - access_id - relation slot_usage: relation: range: DataService QualifiedAnnexAccess: is_a: QualifiedAccess slot_usage: relation: range: AnnexRemote QualifiedPart: slots: - relation - at_location slot_usage: at_location: range: PosixRelPath relation: range: Resource QualifiedGitTrackedPart: is_a: QualifiedPart slot_usage: relation: range: FileInGit File: is_a: Resource slots: - distribution FileInGit: is_a: File mixins: - GitTracked AnnexedFile: is_a: FileInGit slot_usage: distribution: range: AnnexDistribution DataService: slots: - endpoint_url # although we do not expect any data service to have a unique identifier # we must add this slow here, rather than in derived classes, due to # a potential linkml limitation/bug # https://github.com/psychoinformatics-de/datalad-concepts/issues/30 - meta_id AnnexRemote: is_a: DataService slots: - uuid slot_usage: meta_id: required: true ```
yarikoptic commented 9 months ago
FWIW the 2nd fail (using merged model) works "consistently" without any example pointing to a possible issue with (loading of) schema itself -- eg the same result while running on /dev/null as example (also has some debug messages printing slots) ```shell ❯ `which linkml-convert` -s merged-schema.yaml --target-class-from-path /dev/null -t rdf DEBUG-: GitTracked slots: ['gitsha', 'meta_id'] DEBUG0: GitTracked slots: ['gitsha', 'meta_id'] DEBUG1: GitTracked slots: ['gitsha', 'meta_id', 'gitTracked__gitsha', 'gitTracked__meta_id'] DEBUG2: GitTracked slots: ['gitsha', 'meta_id', 'gitTracked__gitsha', 'gitTracked__meta_id'] DEBUG_END: GitTracked slots: ['GitTracked_gitsha', 'meta_id', 'gitTracked__gitsha', 'gitTracked__meta_id'] Traceback (most recent call last): File "/home/yoh/proj/misc/linkml/trash/gh-1812/venvs/dev/bin/linkml-convert", line 8, in sys.exit(cli()) ^^^^^ File "/home/yoh/proj/misc/linkml/trash/gh-1812/venvs/dev/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yoh/proj/misc/linkml/trash/gh-1812/venvs/dev/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/home/yoh/proj/misc/linkml/trash/gh-1812/venvs/dev/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yoh/proj/misc/linkml/trash/gh-1812/venvs/dev/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yoh/proj/misc/linkml/linkml/linkml/utils/converter.py", line 139, in cli python_module = PythonGenerator(schema).compile_module() ^^^^^^^^^^^^^^^^^^^^^^^ File "", line 27, in __init__ File "/home/yoh/proj/misc/linkml/linkml/linkml/generators/pythongen.py", line 71, in __post_init__ super().__post_init__() File "/home/yoh/proj/misc/linkml/linkml/linkml/utils/generator.py", line 197, in __post_init__ self._initialize_using_schemaloader(schema) File "/home/yoh/proj/misc/linkml/linkml/linkml/utils/generator.py", line 240, in _initialize_using_schemaloader loader.resolve() File "/home/yoh/proj/misc/linkml/linkml/linkml/utils/schemaloader.py", line 525, in resolve self.raise_value_error( File "/home/yoh/proj/misc/linkml/linkml/linkml/utils/schemaloader.py", line 955, in raise_value_error SchemaLoader.raise_value_errors(error, loc_str) File "/home/yoh/proj/misc/linkml/linkml/linkml/utils/schemaloader.py", line 963, in raise_value_errors raise ValueError(f'{TypedNode.yaml_loc(loc_str, suffix="")} {error}') ValueError: Class "GitTracked" - multiple keys/identifiers not allowed (meta_id, gitTracked__meta_id) ```
edit: and here is that "merged" version which shows that GitTracked has slot meta_id and also meta_id attribute with identifier=true. So likely the bug/shortcoming was "revealed" by gen-linkml ```yaml GitTracked: name: GitTracked from_schema: https://example.com/reproducer mixin: true slots: - gitsha - meta_id slot_usage: gitsha: name: gitsha domain_of: - GitTracked - FileInGit required: true attributes: gitsha: name: gitsha from_schema: https://example.com/reproducer alias: gitsha owner: GitTracked domain_of: - GitTracked range: SHA1 required: true meta_id: name: meta_id from_schema: https://example.com/reproducer identifier: true alias: meta_id owner: GitTracked domain_of: - GitTracked - DataService range: uriorcurie required: true ```
yarikoptic commented 9 months ago

and FWIW, commenting out that relation: doesn't resolve the situation for me -- just leads to another crash

  File "/home/yoh/proj/misc/linkml/trash/gh-1812/venvs/dev/lib/python3.11/site-packages/linkml_runtime/loaders/yaml_loader.py", line 41, in load_any
    return self._construct_target_class(data_as_dict, target_class)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/proj/misc/linkml/trash/gh-1812/venvs/dev/lib/python3.11/site-packages/linkml_runtime/loaders/loader_root.py", line 132, in _construct_target_class
    return target_class(**data_as_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 8, in __init__
  File "test", line 254, in __post_init__
  File "test", line 192, in __post_init__
  File "test", line 141, in __post_init__
  File "test", line 141, in <listcomp>
  File "<string>", line 6, in __init__
  File "test", line 149, in __post_init__
  File "/home/yoh/proj/misc/linkml/trash/gh-1812/venvs/dev/lib/python3.11/site-packages/linkml_runtime/utils/yamlutils.py", line 48, in __post_init__
    raise ValueError('\n'.join(messages))
ValueError:  Unknown argument: gitsha:b94ef1797f7bfc1ac979be122e1b538bbb0d1d58 = AnnexedFile(has_part=[], meta_type='dlco

does it reproduce for you @mih solely from the information above?

yarikoptic commented 9 months ago

sorry for the noise -- learning etc as I go. Apparently there is some aspect I still do not quite grasp here since the solution I suggested above and changed from true to false was wrong -- I had to comment out that line entirely but add inlined_as_list: true at the Resource level thus overloading inlining etc... the better solution to make it reproducible was just to make original example to use list, not dict, for has_part in the example, so to become

has_annex_remote:
  annex:7e0bf3e7-7d46-4093-813e-b4009826c3bf:
    uuid: 7e0bf3e7-7d46-4093-813e-b4009826c3bf
has_part:
  - meta_id: gitsha:b94ef1797f7bfc1ac979be122e1b538bbb0d1d58
    meta_type: dlco:AnnexedFile
    gitsha: b94ef1797f7bfc1ac979be122e1b538bbb0d1d58
    distribution:
      qualified_access:
        access_id: MD5E-s3425--32a617360d10e3dcbfdd0885e8d64ab8.txt
        relation: annex:7e0bf3e7-7d46-4093-813e-b4009826c3bf
qualified_part:
  - at_location: README.txt
    # comment out the following line to get a working conversion
    relation: gitsha:b94ef1797f7bfc1ac979be122e1b538bbb0d1d58

and then the issue "fully" reproduces.