mapping-commons / sssom-py

Python toolkit for SSSOM mapping format
https://mapping-commons.github.io/sssom-py/index.html#
MIT License
49 stars 12 forks source link

Implementing separate methods for JSON and JSONLD #494

Open matentzn opened 7 months ago

matentzn commented 7 months ago

This PR adds methods

Which are exactly analogous to what was there before for JSON.

But its actual purpose is not so much to add those methods, but to carefully review the format (to make sure we are happy) so we can start making headway on https://github.com/mapping-commons/sssom/issues/321.

Breaking changes

JSON Format

We need to make sure that the JSON format looks exactly as we envision it. Problems I see so far

Here is an example JSON file ``` { "mapping_set_id": "https://w3id.org/sssom/mapping/tests/data/basic.tsv", "license": "https://creativecommons.org/publicdomain/zero/1.0/", "mappings": [ { "subject_id": "a:something", "predicate_id": "rdfs:subClassOf", "object_id": "b:something", "mapping_justification": "semapv:LexicalMatching", "subject_label": "XXXXX", "subject_category": "biolink:AnatomicalEntity", "object_label": "xxxxxx", "object_category": "biolink:AnatomicalEntity", "subject_source": "a:example", "object_source": "b:example", "mapping_tool": "rdf_matcher", "confidence": 0.8, "subject_match_field": [ "rdfs:label" ], "object_match_field": [ "rdfs:label" ], "match_string": [ "xxxxx" ], "comment": "mock data" }, { "subject_id": "a:something", "predicate_id": "owl:equivalentClass", "object_id": "c:something", "mapping_justification": "semapv:LexicalMatching", "subject_label": "XYXYX", "subject_category": "biolink:AnatomicalEntity", "object_label": "xyxyxy", "object_category": "biolink:AnatomicalEntity", "subject_source": "a:example", "object_source": "c:example", "mapping_tool": "rdf_matcher", "confidence": 0.83, "subject_match_field": [ "rdfs:label" ], "object_match_field": [ "rdfs:label" ], "match_string": [ "xxxxx" ], "comment": "mock data" } ], "creator_id": [ "orcid:1234", "orcid:5678" ], "mapping_tool": "https://github.com/cmungall/rdf_matcher", "mapping_date": "2020-05-30" } ```

The two remaining errors are also exactly due to this problem:

FAILED tests/test_conversion.py::SSSOMReadWriteTestSuite::test_conversion - AssertionError: 6 != 8 : JSON document has less elements than the orginal one for basic.tsv. Json: {"mapping_set_id": "https:...
FAILED tests/test_parsers.py::TestParseExplicit::test_round_trip_json - ValueError: {'UMLS', 'orcid', 'DOID'} are used in the SSSOM mapping set but it does not exist in the prefix map
gouttegd commented 7 months ago

We will probably have to https://github.com/mapping-commons/sssom/issues/225

The problem we might run into with that is that, as far as I know (and as I have noted in the discussion about the extension slots), LinkML does not have a map type. We’d want to declare a field that could be used like this:

"curie_map": {
  "FBbt": "http://purl.obolibrary.org/obo/FBbt_"
}

but unless I missed something in LinkML’s docs, this is not possible. All we can do is to have a list (i.e. a “multi-valued” field) of custom “dictionary entry“ types, like this:

"curie_map": [
    { "key": "Fbbt",
      "value": "http://purl.obolibrary.org/obo/FBbt_" }
  ]

which of course would work but would be… weird, at the very least.

My own solution (that nobody will like, I know) to that is simple: decide that CURIEfied identifiers are only for the TSV format (which is what the spec currently says, incidentally), JSON should only contain full-length identifiers. No CURIE map needed, problem solved.