Closed julesjacobsen closed 3 months ago
@julesjacobsen great you are looking at this. I was thinking to do the semantic similarity case like this:
subject_id | predicate_id | object_id | match_type | information_content_mica_score | semantic_similarity_score | mapping_tool | other |
---|---|---|---|---|---|---|---|
HP:0002651 | skos:relatedMatch | HP:0002651 | SemanticSimilarity | 0.9 | 0.2 | https://en.wikipedia.org/wiki/Jaccard_index | subject_information_content:0.1|object_information_content:0.3|mica_id:"HP:0002651" |
Basically we use the other
field to convey any kind of information that is so specific it is beyond sssom
. What do you think?
I think you will need to declare semantic similarity by which algorithm, ideally via a PID
Yeah I was JUST thinking that!
So just imagine instead of the wikipedia link a PMID!
or a URI or a DOI or a..... :-)
@matentzn OK, seems reasonable to use the other
field for the other bits. However what about the 0-1 for the IC?
@mellybelly better than a PMID would be a GitHub build link or equivalent - it was ultimately a specific version of a piece of software and the versions of the input data which created the output data. Bugs happen, implementation details change leading to subtle changes in output. None of this is captured in a PMID. For example we have OWL-sim-2 or OWL-sim-3 which both implement some of the same core algorithms. OWL-tools has a couple of official releases whereas owlsim-3 never had an official tag or release, but you can still build the source from a specific commit sha e.g. https://github.com/monarch-initiative/owlsim-v3/tree/788fc8b54f30a160744f99fa09f65dd60352bc59.
I agree that the best thing would be to link whatever is most enlightening to understand the mapping result - this is often a link to a github script, but sometimes a published algorithm can be useful two. There is a bit of an argument here to add three fields, one for the mapping_tool (which can be a link to a version of a software like owlsim or phenol), mapping_tool_config (a string of words that could be used to regenerate the mappings, say a shell command) and mapping_tool_reference (an associated publication).
@julesjacobsen i agree- i would like the PID to point at the actual artifact/algorithm itself. Github->zenodo release is one good mechanism as zenodo is considered archival but really anything that is pointing at the actual artifact is better than a PMID.
@matentzn thinking about this a bit more, what is the reason for including information_content_mica_score
but excluding mica_id
and therefore mica_label
? Conversely, given there is a information_content_mica_score
, why not have information_content_subject_score
and information_content_object_score
?
Its a bit of scope creep to include these as first-class fields, but wouldn't they be reasonable to include for the match_type:SemanticSimilarity use-case for the SSSOM? I guess the test for this is whether there are/ how many other use-cases which need extra information in the other
field and whether or not adding them makes a complete bloated mess of the current set of elements.
So given your own expertise, if we added all those; would that cover 90% of the all cases to represent semantic similarity based mappings? Or is it just Monarch that cares about this exact combination of factors?
I am inclined to add these all, but only if I know that this does not mean we have to add 30 others for all kinds of semantic similarity requirements..
It would cover a lot (most? all?) of the IC-based similarity measures, or at least help in their final calculation (e.g. Phenodigm) but there are others out there which it wouldn't help with. I'm really not the best person to ask though - @cmungall or @drseb would be far more knowledgeable.
I seem to miss some context here, is there a document to read up what a Standard for Sharing Ontology Mappings has to do with semantic similarity computations?
BTW: semantic similarity scores are not always between 0 and 1 - if I remember correctly, even phenodigm fails in this unless you apply tricks
semantic similarity is just a special case for a complex mapping based on an algorithm (mapping_tool
); for example, you can derive an MP-HP map by computing the similarity scores between all individual terms, and then, say, cut them off at some threshold of similarity. It is slightly out of scope to provide so many details about semantic similarity, but its such an important case, that we decided to include it.
I dont really mind about the similarity score range. I find it useful to define similarity between 0 and 1, but if that's too tight a constraint, we can drop it; I did not know that there were on a different scale. Lets just drop that constraint then.
Other than the Seb, does this make sense? Is there any other important metadata for a single term-term similarity measure we would need to embed? I know you deal mostly with termset<->termset similarity, but maybe you can think of something analogous..
BTW: semantic similarity scores are not always between 0 and 1 - if I remember correctly, even phenodigm fails in this unless you apply tricks
Good point, I forgot we apply tricks to fit it to a 0-1 range. We scale the phenix scores too, but only for the final profile match hits. The single term-term scores are unbounded.
We are using a different model now in Monarch for sharing semantic similarity tables, closing here for being out of scope: https://github.com/INCATools/ontology-access-kit/blob/main/src/oaklib/datamodels/similarity.yaml
This looks great. Can you give some examples how the phenodigm class scores similarity output might look from OWLSim.
The OWLSim format from
--sim-save-phenodigm-class-scores
is (partly translating the column headers to SSSOM):Assuming this is within scope for SSSOM
semantic_similarity_score
was calculated e.g. Jaccard, Linmica_id
, but there is a field forinformation_content_mica_score
. Is this deliberate?information_content_mica_score
supposed to be a normalised 0-1 for the dataset?