mapping-commons / sssom

Simple Standard for Sharing Ontology Mappings
https://mapping-commons.github.io/sssom/
BSD 3-Clause "New" or "Revised" License
141 stars 24 forks source link

Semantic similarity example #30

Open julesjacobsen opened 3 years ago

julesjacobsen commented 3 years ago

This looks great. Can you give some examples how the phenodigm class scores similarity output might look from OWLSim.

The OWLSim format from --sim-save-phenodigm-class-scores is (partly translating the column headers to SSSOM):

# subject_id    object_id   simj    IC  mica_id
HP_0002651  HP_0002651  1.0 8.829843768215113   HP_0002651;

Assuming this is within scope for SSSOM

  1. Is there anywhere to add the algorithm name to describe how the the semantic_similarity_score was calculated e.g. Jaccard, Lin
  2. In the sssom_metadata there is no field for mica_id, but there is a field for information_content_mica_score. Is this deliberate?
  3. Is the information_content_mica_score supposed to be a normalised 0-1 for the dataset?
Element ID Description TSV Example RDF example Scope Entity Type Required Datatype Equivalent property Sub Property
sssom:semantic_similarity_score A score between 0 and 1 to denote the semantic similarity, where 1 denotes equivalence. 0.8 0.8 L owl:AnnotationProperty 0 xsd:double   sssom:metadata_element
sssom:information_content_mica_score A score between 0 and 1 to denote the information content of the most informative common ancestor, where 1 denotes the maximum level of informativeness. 0.3 0.3 L owl:AnnotationProperty 0 xsd:double   sssom:metadata_element
matentzn commented 3 years ago

@julesjacobsen great you are looking at this. I was thinking to do the semantic similarity case like this:

subject_id predicate_id object_id match_type information_content_mica_score semantic_similarity_score mapping_tool other
HP:0002651 skos:relatedMatch HP:0002651 SemanticSimilarity 0.9 0.2 https://en.wikipedia.org/wiki/Jaccard_index subject_information_content:0.1|object_information_content:0.3|mica_id:"HP:0002651"

Basically we use the other field to convey any kind of information that is so specific it is beyond sssom. What do you think?

mellybelly commented 3 years ago

I think you will need to declare semantic similarity by which algorithm, ideally via a PID

matentzn commented 3 years ago

Yeah I was JUST thinking that!

matentzn commented 3 years ago

So just imagine instead of the wikipedia link a PMID!

mellybelly commented 3 years ago

or a URI or a DOI or a..... :-)

julesjacobsen commented 3 years ago

@matentzn OK, seems reasonable to use the other field for the other bits. However what about the 0-1 for the IC?

@mellybelly better than a PMID would be a GitHub build link or equivalent - it was ultimately a specific version of a piece of software and the versions of the input data which created the output data. Bugs happen, implementation details change leading to subtle changes in output. None of this is captured in a PMID. For example we have OWL-sim-2 or OWL-sim-3 which both implement some of the same core algorithms. OWL-tools has a couple of official releases whereas owlsim-3 never had an official tag or release, but you can still build the source from a specific commit sha e.g. https://github.com/monarch-initiative/owlsim-v3/tree/788fc8b54f30a160744f99fa09f65dd60352bc59.

matentzn commented 3 years ago

I agree that the best thing would be to link whatever is most enlightening to understand the mapping result - this is often a link to a github script, but sometimes a published algorithm can be useful two. There is a bit of an argument here to add three fields, one for the mapping_tool (which can be a link to a version of a software like owlsim or phenol), mapping_tool_config (a string of words that could be used to regenerate the mappings, say a shell command) and mapping_tool_reference (an associated publication).

mellybelly commented 3 years ago

@julesjacobsen i agree- i would like the PID to point at the actual artifact/algorithm itself. Github->zenodo release is one good mechanism as zenodo is considered archival but really anything that is pointing at the actual artifact is better than a PMID.

julesjacobsen commented 3 years ago

@matentzn thinking about this a bit more, what is the reason for including information_content_mica_score but excluding mica_id and therefore mica_label? Conversely, given there is a information_content_mica_score, why not have information_content_subject_score and information_content_object_score?

Its a bit of scope creep to include these as first-class fields, but wouldn't they be reasonable to include for the match_type:SemanticSimilarity use-case for the SSSOM? I guess the test for this is whether there are/ how many other use-cases which need extra information in the other field and whether or not adding them makes a complete bloated mess of the current set of elements.

matentzn commented 3 years ago

So given your own expertise, if we added all those; would that cover 90% of the all cases to represent semantic similarity based mappings? Or is it just Monarch that cares about this exact combination of factors?

I am inclined to add these all, but only if I know that this does not mean we have to add 30 others for all kinds of semantic similarity requirements..

julesjacobsen commented 3 years ago

It would cover a lot (most? all?) of the IC-based similarity measures, or at least help in their final calculation (e.g. Phenodigm) but there are others out there which it wouldn't help with. I'm really not the best person to ask though - @cmungall or @drseb would be far more knowledgeable.

drseb commented 3 years ago

I seem to miss some context here, is there a document to read up what a Standard for Sharing Ontology Mappings has to do with semantic similarity computations?

drseb commented 3 years ago

BTW: semantic similarity scores are not always between 0 and 1 - if I remember correctly, even phenodigm fails in this unless you apply tricks

matentzn commented 3 years ago

semantic similarity is just a special case for a complex mapping based on an algorithm (mapping_tool); for example, you can derive an MP-HP map by computing the similarity scores between all individual terms, and then, say, cut them off at some threshold of similarity. It is slightly out of scope to provide so many details about semantic similarity, but its such an important case, that we decided to include it.

I dont really mind about the similarity score range. I find it useful to define similarity between 0 and 1, but if that's too tight a constraint, we can drop it; I did not know that there were on a different scale. Lets just drop that constraint then.

Other than the Seb, does this make sense? Is there any other important metadata for a single term-term similarity measure we would need to embed? I know you deal mostly with termset<->termset similarity, but maybe you can think of something analogous..

julesjacobsen commented 3 years ago

BTW: semantic similarity scores are not always between 0 and 1 - if I remember correctly, even phenodigm fails in this unless you apply tricks

Good point, I forgot we apply tricks to fit it to a 0-1 range. We scale the phenix scores too, but only for the final profile match hits. The single term-term scores are unbounded.