Add new "literal profile" solution based on "subject_label" and "object_label"

matentzn commented 3 months ago

Resolves #234

[X] docs/ have been added/updated if necessary
[X] make test has been run locally
[ ] tests have been added/updated (if applicable)
[x] CHANGELOG.md has been updated.
[ ] Change base branch to main

If you are proposing a change to the SSSOM metadata model, you must

[x] provide a full, working and valid example in examples/
[x] provide a link to the related GitHub issue in the see_also field of the linkml model
[x] provide a link to a valid example in the see_also field of the linkml model
[x] run SSSOM-Py test suite against the updated model

This PR introduces an alternative model to the "literal mapping" proposal we have previously added. It is built on the following principles:

Introducing a profile causes a needless overhead which requires specialised tooling
We can achieve the same functionality with a very minimal intervention in the main metadatamodel

I do understand that there are various opposing views on the need of a "literal" profile, but I think this super minimal intervention will satisfy both sides.

Huge thanks to @gouttegd who managed to steer this massive carrier ship after it has left the harbour. This is rarely successful and I am very happy we managed to make it!

gouttegd commented 3 months ago

Two more questions:

1) Do we want to represent unmappable literal mappings? If yes, how?

This is basically a question of how the “literal profile” and sssom:NoTermFound should play together.

There are two distinct cases.

1.a) One side is a literal, the other side is sssom:NoTermFound.

Unless I am missing something, this case should be straightforward and should not require any special treatment. That’s a good thing, because it’s the case that is most likely to be useful: if you have to map a list of literals (say, a list of scRNAseq cluster names) to a target vocabulary (say, the CL ontology), you want to be able to say that a given literal could not be mapped to any term in the target vocabulary.

1.b) The literal and the “term not found” are on the same side.

Say you want to map a vocabulary (again, let’s say CL) to a list of literals. You want to say that a given term in CL is not represented in the list of literals. We have a problem here, because the way to represent a “term not found” is to use sssom:NoTermFound as the object_id (assuming the literals are on the object side – it would not change anything if they were on the subject side), but object_id is not used for a literal mapping (literals have no IDs, almost by definition).

So what to do here? I can imagine 3 possibilities:

1.b.1) We don’t support that case. Seems quite reasonable to me a priori, as I don’t think it is a very relevant use case (contrary to the 1.a case). That would be the current situation (with the profile as defined in this PR).

1.b.2) Use sssom:NoTermFound as the object_id, as we would do for non-literal mappings. This would mean that, even for literal mappings (where we are normally concerned only with object_label), we would need to lookup object_id just to check whether it contains the special sssom:NoTermFound value. It leaves open the question of what to put in the object_label slot in that case (we have to put something in that slot: it cannot be empty).

1.b.3) Put in the object_label slot a special value similar to sssom:NoTermFound, intended to indicate that there is no mapped literal. It could even be the value sssom:NoTermFound itself (probably the easiest way).

2) Computing mapping_cardinality for literal mappings

Not really a question, but when computing the mapping_cardinality, if we are dealing with literal mappings, we must deal with the _label slots rather than the _id slots.

That is, for example: if we are dealing with a set of mappings with literals on the subject side and entities on the object side, then a cardinality of 1:n means that one subject_label (NOT one subject_id) maps to several object_id.

That is, I think, quite obvious, but it should still be mentioned explicitly somewhere.

matentzn commented 3 months ago

Ok, I will fold that in (think about it first). As an aside, I like it if actionable comments to be added as comments on the diffs so that they show up as "resolveable", so that they also block PRs being merged unless they are explicitly resolved. If there is no good place for such a comment you can just put in on the Changelog diff or on any other file that makes a little bit of sense.

gouttegd commented 3 months ago

As an aside, I like it if actionable comments to be added as comments on the diffs so that they show up as "resolveable"

Duly noted. :)

gouttegd commented 3 months ago

The SSSOM-Py test suite fails on three tests:

test_utils.py::TestUtils::test_get_dict_from_mapping fails because it expects a semantic_similarity_score slot. This has nothing to do with this PR and is instead obviously the result of #386, which was not tested against SSSOM-Py before being merged.
test_parsers.py::TestParse::test_parse_alignment_xml fails when parsing this:

<Cell>"
  <entity1 rdf:resource="http://randomurlwithnochancetobeinprefixmap.org/ID_123"/>
  <entity2 rdf:resource="http://purl.obolibrary.org/obo/WBbt_0005815"/>
  <measure rdf:datatype="xsd:float">0.5</measure>
  <relation>=</relation>
</Cell>

My understanding is that, after some checking against a prefix map, the entity1 ends up being being empty/None/null, and prior to that PR SSSOM-Py could count on the Mapping constructor to automatically reject the mapping as invalid (since the Python generated code included a non-null check because of the required=true attribute on subject_id). With this PR, subject_id is no longer required, so the mapping is “successfully” constructed – which is wrong, and the test catches that.

It looks like SSSOM-Py cannot simply rely on the LinkML runtime implementing the rules set forth in this PR, so it will have to implement its own validation of subject_id / object_id.

test_parsers.py::TestParse::test_parse_obographs: I have not looked in details, but it’s most likely a similar problem as above.

gouttegd commented 3 months ago

The issues above are putatively fixed by https://github.com/mapping-commons/sssom-py/pull/552.

mapping-commons / sssom

Add new "literal profile" solution based on "subject_label" and "object_label" #384