linkml / linkml

Linked Open Data Modeling Language
https://linkml.io/linkml
Other
309 stars 95 forks source link

Using wrong XSD type for URIorCURIE #2213

Open Silvanoc opened 1 month ago

Silvanoc commented 1 month ago

Describe the bug

Urieorcurie type is declared with URI xsd:anyURI, what is wrong since a CURIE might not be a valid URI.

To reproduce Steps to reproduce the behavior:

  1. Get an XML Schema Validator (I have tested it with this and this online validators )
  2. Provide an XML Schema that specifies the type xsd:anyURI on an attribute value. This is the one that I have used: ``` ```
  3. Provide a minimalistic XML document containing a CURIE with an underscore '_' in the prefix. This is what I have used: <slot_type src="pre_fix:reference"/>
  4. See error cvc-datatype-valid.1.2.1: 'pre_fix:reference' is not a valid value for 'anyURI'. or similar.

Expected behavior A type that gets a XML-Schema URI as its URI, should comply with the corresponding XML-Schema.

The is no type in the "W3C XML Schema Definition Language (XSD) 1.1" for CURIEs. The CURIE specification provides an XML-Schema for CURIEs, but this does not help for the URI of the LinkML Uriorcurie or Curie types.

In this specific case, having xsd:anyURI as URI for Uriorcurie LinkML type should mean that any value that is a valid URI or CURIE should pass the XML-Schema validation, what is not true for the valid CURIE test_pref:Boat.

Additional context LinkML Model version: 1.8.x

I have discovered the issue by chance when trying to improve the JSON-Schema generated for URI, CURIE and URIorCURIE in MR #2212.

In commit 82b753b1ae9cf23b06707b04cb8838dea654b161 I have assigned JSON-Schema type string with format uri for all there types, assuming that CURIEs would be valid URIs because of the use of xsd:anyURI for the URI of the LinkML type Uriorcurie.

Luckily a test exists covering the corner case of a CURIE prefix containing an underscore: test_pref:Boat and this test is failing with the changes proposed in the PR #2212.

mahdanoura commented 1 month ago

what is wrong since a CURIE might not be a valid URI.

Yes, this is also clearly stated in the CURIE syntax Note document:

CURIEs and SafeCURIEs map to IRIs, but neither a CURIE nor a Safe_CURIE is an IRI or URI.

With the _ character there is a special case, that states: The CURIE prefix '_' is reserved for use by languages that support RDF. For this reason, the prefix '_' SHOULD be avoided by authors.

Therefore even if there is a _ character in the CURIE prefix, it is still considered as a valid CURIE, but at the same time making it an invalid URI. A URI schema cannot have the character _.

A use case for a valid CURIE starting with _ is an RDF Blank Node Identifier e.g., _:b0

turbomam commented 1 month ago

Thanks @Silvanoc and @mahdanoura. You guys are much better at interpreting and testing against W3C specifications than I am. I would still like to add something to this issue:

The National Microbiome Data Collaborative has as use case in which the values of a uriorcurie-type slot are expected to be converted into CURIes or URIs as part of linkml-convert conversion form JSON or YAML into RDF. This doesn't currently work. As you might imagine, the values are just asserted as xsd:anyURI-typed string literals.

I have written some Python code that detects the xsd:anyURI-typed strings in a Turtle serialization and does the conversion to CURIes. @cmungall is aware of this and I think he has some intention of addressing it in LinkML. In the mean time, I hope we can keep the association between Urieorcurie type and xsd:anyURI, becasue without it, I don't see how I make the last step conversion to Turtle CURIes.

We don't have any prefixes containing underscores in our main schema file. I don't think we have any in the other import either but I haven't checked yet.

Silvanoc commented 1 month ago

The National Microbiome Data Collaborative has as use case in which the values of a uriorcurie-type slot are expected to be converted into CURIes or URIs as part of linkml-convert conversion form JSON or YAML into RDF. This doesn't currently work. As you might imagine, the values are just asserted as xsd:anyURI-typed string literals.

The Uriorcurie type should accept both, therefore any change fulfilling this requirement should not break this use-case.

WRT the assertion as xsd:anyURI, I wonder how/where that assertion is taking place. AFAIK no XML-Schema validation is taking place and most part of the validation takes place checking against a generated JSON-Schema. In that sense the usage of xsd:anyURI to specify the URI of the type Uriorcurie is only a hint for the JSON-Schema generator to know how to validate it. My expectation would be that any change in the JSON-Schema generation that keeps the validity of the value either as an URI or as a CURIE should not break anything.

In the end IMO as long as we are confident that the tests are covering that use-case, we should be able to work on improving the validation.

We don't have any prefixes containing underscores in our main schema file. I don't think we have any in the other import either but I haven't checked yet.

You might not have any, but:

  1. For whatever reason, you have an example containing one: https://github.com/linkml/linkml/blob/669fbff40a2e7683797d0c4e510bc21dee7815aa/tests/test_issues/input/linkml_issue_1608_data.yaml#L4
  2. For LinkML to comply with the W3C specification of the CURIE format, it must support prefixes containing underscores. And IMO LinkML should comply with the standards it claims to support.

The current situation is that you are declaring an XML-Schema type that does not comply with the specification (an XML-Schema validation of test_pref:Boat where an Uriorcurie is expected would fail). But since the validation is using a JSON-Schema type much more relaxed than the declared XML-Schema, nobody seems to notice it.

turbomam commented 1 month ago

OMG, I did write that test. I agree that it was a bad choice and am in support of any mechanism that would invalidate it!

Silvanoc commented 1 month ago

OMG, I did write that test. I agree that it was a bad choice and am in support of any mechanism that would invalidate it!

Hopefully you don't mean to invalidate the test... Because that test is the very single one ensuring that we have at least one for CURIEs that are valid, but the can not be mistaken for an URI just because they are syntactically valid.

Silvanoc commented 1 month ago

I have written some Python code that detects the xsd:anyURI-typed strings in a Turtle serialization and does the conversion to CURIes. @cmungall is aware of this and I think he has some intention of addressing it in LinkML. In the mean time, I hope we can keep the association between Urieorcurie type and xsd:anyURI, becasue without it, I don't see how I make the last step conversion to Turtle CURIes.

@turbomam how do you handle in your code references of LinkML type Curie? Because they are going to get the datatype xsd:string, right? How do you want to make them apart from simple strings? Since strings also get xsd:string.

Apart from that, since both LinkML types Uri and UriOrCurie are getting the URI xsd:anyURI, you need to "probe" both, although it should be needed only for UriOrCurie.

https://github.com/linkml/linkml-model/pull/202 (base for discussion as of now) is trying to resolve those ambiguities, you might want to have a look at it. The fact that UriOrCurie has some space for ambiguity is known and intrinsically accepted in this type, since we are not using SafeCURIEs. But at least my proposal constraint those ambiguities to that type.

sierra-moxon commented 1 month ago

from dev call: let's focus on fixing the RDFWriter to convert this correctly.