monarch-initiative / semsimian

Simple rust implementation of semantic similarity
BSD 3-Clause "New" or "Revised" License
9 stars 5 forks source link

Explain why similarity values the way they are #82

Closed matentzn closed 1 year ago

matentzn commented 1 year ago

Can we get a high level description of how the jaccard score is computed?

We note two oddities to start with:

  1. The distribution of semantic similarity values is roughly "normal" around some strange mean ~19%, while in my opinion, the mean similarity should be much closer to 0, like 2% (accounting for the case that every class has at least one shared parent). Maybe this is correct, but we just need to understand it

image

  1. The jaccard similarity is really low - much too low IMO, and also not very discriminatory:

Score stats:

count            mean      std  min 25% 50% 75% max
jaccard_similarity  231537059   0.189693    0.076021    0   0.145161    0.1875  0.234043    0.69697

Examples:

HP:0011015          MP:0000188          UPHENO:0051766                  13.008877205969117  0.46    NaN     2.4462386463192414
HP:0001943          MP:0000189          UPHENO:0051766                  13.008877205969117  0.38333333333333336 NaN     2.2331001460499174
HP:0004380          MP:0006116          UPHENO:0067809                  15.686949111081756  0.45901639344262296 NaN     2.6833871888131813

All of these should:

  1. Be much closer to 1 (i.e. 0.8, 0.9)
  2. Be further apart from the less similar ones.

Before we even decide what to do here (we can always normalise after the fact):

Can we get a list of all features that went into the Jaccard computation? Apart from subclass? And also, where can I read about the jaccard formula implemented in semsimian?

cc @souzadevinicius

matentzn commented 1 year ago

@souzadevinicius what was the Y axis of the plot above?

caufieldjh commented 1 year ago

For context, can you let us know what exact command was run and on what version of the ontology?

caufieldjh commented 1 year ago

I agree that those values look strange and I'd also expect them to be higher. For these examples, what is the stock OAK Jaccard (i.e., without semsimian)?

matentzn commented 1 year ago

@souzadevinicius can you compute the SemSim values from normal non SemSimian oak with just these example terms?

cmungall commented 1 year ago

I welcome all exploratory analysis like this but it should all be in jupyter notebooks in GitHub so everything is transparent and reproducible!

cmungall commented 1 year ago

Another plot that may help elucidate is union on one axis and intersection on other

souzadevinicius commented 1 year ago

@souzadevinicius what was the Y axis of the plot above?

The y-axis represents the count number of jaccard_similarity values.

1e7 is the scientific notation for the number: 10000000

Here we have an example without scientific notation due to the smaller count number

image

We have fewer values here because this analysis was generated passing a 0.7 threshold.

matentzn commented 1 year ago

Can you share a link to the jupyter notebook generating this plot? The first plot looks like there is no single value above 0.7, the second suggests there are thousands. Are you using a different ontology for the second? The threshold should only cut off, not change the similarity scores.

souzadevinicius commented 1 year ago

Can you share a link to the jupyter notebook generating this plot? The first plot looks like there is no single value above 0.7, the second suggests there are thousands. Are you using a different ontology for the second? The threshold should only cut off, not change the similarity scores.

Sorry, @matentzn . Once the screenshots have no title, and I didn't explain it to you, there is a small confusion. The first screenshot was generated using HPxMP comparison. The last was done using HPxHP

Just remembering some numbers:

HPxMP

  count mean std min 25% 50% 75% max
jaccard_similarity 231537059 0.189693 0.076021 0 0.145161 0.187500 0.234043 0.696970

HPxHP

  count mean std min 25% 50% 75% max
jaccard_similarity 284037008 0.235657 0.116312 0 0.168831 0.229167 0.296296 1

HPxHP (threshold 0.7)

  count mean std min 25% 50% 75% max
jaccard_similarity 1186052 0.799334 0.073693 0.700637 0.736842 0.782609 0.851852 1

BTW, here is the link to the jupyter notebook script:

https://github.com/monarch-initiative/monarch-semantic-similarity-profiles/blob/main/semsim_analysis.ipynb

caufieldjh commented 1 year ago

After some discussion with @hrshdhgd I am increasingly willing to believe these are, in fact, correct (or at least not too far from what we'd expect).

Ancestors of HP:0000118:

id      label
BFO:0000001     entity
BFO:0000001     entity
BFO:0000002     continuant
BFO:0000002     continuant
BFO:0000004     independent continuant
BFO:0000004     independent continuant
BFO:0000020     specifically dependent continuant
BFO:0000040     material entity
BFO:0000040     material entity
CARO:0000000    anatomical entity
CARO:0000000    anatomical entity
CARO:0000003    anatomical structure
CARO:0000003    connected anatomical structure
CARO:0000003    connected anatomical structure
CARO:0000006    material anatomical entity
CARO:0000006    material anatomical entity
CARO:0000012    multicellular organism
CARO:0001008    gross anatomical part
CARO:0001010    organism or virus or viroid
CARO:0010000    multicellular anatomical structure
CARO:0010004    cellular organism
CARO:0030000    biological entity
HP:0000001      All
HP:0000001      All (HPO)
HP:0000118      Phenotypic abnormality
HP:0000118      Phenotypic abnormality (HPO)
PATO:0000001    quality
UBERON:0000061  anatomical structure
UBERON:0000465  material anatomical entity
UBERON:0000468  multicellular organism
UBERON:0001062  anatomical entity
UBERON:0010000  multicellular anatomical structure
UPHENO:0001001  Phenotype
UPHENO:0001001  phenotype
UPHENO:0001002  Phenotypic abnormality
UPHENO:0001003  phenotype by ontology source
UPHENO:0001005  abnormal phenotype by ontology source

My naïve assumption is that HP:0000118 and MP:0000001 would have very similar jaccard values, but their calculated intersection may be lower than expected due to cases like the HP term's path to the same root as MP involving additional terms like UPHENO:0001005.

matentzn commented 1 year ago

😱

matentzn commented 1 year ago

I opened this ticket in response:

https://github.com/monarch-initiative/monarch-semantic-similarity-profiles/issues/7

Can we still provide an answer to:

Can we get a list of all features that went into the Jaccard computation? Apart from subclass? And also, where can I read about the jaccard formula implemented in semsimian?

In other words: what do you mean by "ancestor"?

caufieldjh commented 1 year ago

Ancestor being a member of the transitive closure of all is-a and part-of relationships. This is by a necessity part of the Jaccard calculation, and in Semsimian that's here: https://github.com/monarch-initiative/semsimian/blob/f9d53f0295f7bdd033b43bdebfab9a401ae1e95e/src/similarity.rs#L29-L35 (this operates on sets of strings; there's an equivalent function below it for sets of integers and I suspect that's faster)

So that's the total of intersecting members between the sets over the total members of both sets. Jaccard similarity, yes indeed.

But where do those sets come from?

Given a pair of entities and a set of predicates, they're looked up from a predefined closure table: https://github.com/monarch-initiative/semsimian/blob/f9d53f0295f7bdd033b43bdebfab9a401ae1e95e/src/similarity.rs#L10-L27

But where does that closure table come from?

It's loaded from the sqlite representation, as that already contains closures and we don't want to have to instantiate those from scratch every time we calc semsim.

matentzn commented 1 year ago

Thats great, thanks.

  1. What if we want to use relationships other than part of and is a? How much of an architectural change would that be?
  2. So basically what you are saying is: somehow semsql thinks that HPO:0000118 is a part of UBERON:anatomical entity?
caufieldjh commented 1 year ago
  1. What if we want to use relationships other than part of and is a? How much of an architectural change would that be?

No change at all! Semsimian just updates the closure map based on whatever predicate types are required.

  1. So basically what you are saying is: somehow semsql thinks that HPO:0000118 is a part of UBERON:anatomical entity?

Based on https://github.com/monarch-initiative/monarch-semantic-similarity-profiles/issues/7 I think the above results are from an older PHENIO version. I no longer see any path between HPO:0000118 and UBERON:0001062 in PHENIO. There is a shared ancestor between MP:0000001 and UBERON:0001062, though:

$ runoak -i sqlite:obo:phenio paths MP:0000001 @ UBERON:0001062
subject subject_label   object  object_label    path    path_label
MP:0000001      mammalian phenotype (MPO)       UBERON:0001062  anatomical entity       ['MP:0000001', 'UPHENO:0001003', 'UPHENO:0001001', 'PATO:0000001', 'BFO:0000020', 'BFO:0000002', 'BFO:0000004', 'UBERON:0001062']    mammalian phenotype (MPO)|phenotype by ontology source|Phenotype|quality|specifically dependent continuant|continuant|independent continuant|anatomical entity
matentzn commented 1 year ago
$ runoak -i sqlite:obo:phenio paths MP:0000001 @ UBERON:0001062
subject subject_label   object  object_label    path    path_label
MP:0000001      mammalian phenotype (MPO)       UBERON:0001062  anatomical entity       ['MP:0000001', 'UPHENO:0001003', 'UPHENO:0001001', 'PATO:0000001', 'BFO:0000020', 'BFO:0000002', 'BFO:0000004', 'UBERON:0001062']    mammalian phenotype (MPO)|phenotype by ontology source|Phenotype|quality|specifically dependent continuant|continuant|independent continuant|anatomical entity

This makes no sense it should look something like:

This part of the path is wrong:

BFO:0000002 (up until BFO:0000002 everything is correct) --> 'BFO:0000004' (BFO:4 is a subclass of BFO:2 - not the other way around).

'BFO:0000004' --> 'UBERON:0001062' ---> There is absolutely no connection between the two that would make sense, the than perhaps that UBERON:0001062 is a subclass of BFO:0000004.

I am assuming here the path in your output was sort of in the right order..

caufieldjh commented 1 year ago

BFO:0000002 (up until BFO:0000002 everything is correct) --> 'BFO:0000004' (BFO:4 is a subclass of BFO:2 - not the other way around).

Yes - BFO:0000002 is the common ancestor, so the only way you can get to UBERON:0001062 is by going back down the hierarchy. So if UBERON:0001062 is intended to be a subclass of BFO:0000004 (and that is how UBERON defines it) then this makes sense to me except for the fact that HP doesn't do the same - there's not even a path between HPO:0000118 and BFO:0000002.

matentzn commented 1 year ago

We dont really care I guess that there is a shared ancestor between MP:PA and UBERON:AE - every two entities in PHENIO (or most) will have BFO:entity as a shared ancestor.

My question is more: how is it possible that HP:0000118 (or MP) have UBERON:AE as an ancestor? This is what your above comment seems to suggest: https://github.com/monarch-initiative/semsimian/issues/82#issuecomment-1658950359

Even if it is an old version, something is going very wrong if that is the case.

souzadevinicius commented 1 year ago
  1. What if we want to use relationships other than part of and is a? How much of an architectural change would that be?

No change at all! Semsimian just updates the closure map based on whatever predicate types are required.

  1. So basically what you are saying is: somehow semsql thinks that HPO:0000118 is a part of UBERON:anatomical entity?

Based on monarch-initiative/monarch-semantic-similarity-profiles#7 I think the above results are from an older PHENIO version. I no longer see any path between HPO:0000118 and UBERON:0001062 in PHENIO. There is a shared ancestor between MP:0000001 and UBERON:0001062, though:

$ runoak -i sqlite:obo:phenio paths MP:0000001 @ UBERON:0001062
subject subject_label   object  object_label    path    path_label
MP:0000001      mammalian phenotype (MPO)       UBERON:0001062  anatomical entity       ['MP:0000001', 'UPHENO:0001003', 'UPHENO:0001001', 'PATO:0000001', 'BFO:0000020', 'BFO:0000002', 'BFO:0000004', 'UBERON:0001062']    mammalian phenotype (MPO)|phenotype by ontology source|Phenotype|quality|specifically dependent continuant|continuant|independent continuant|anatomical entity

In fact, these results were produced using phenio version 2023-05-01.

cmungall commented 1 year ago

Remember paths operates over all predicates by default, just like any oak graph command. There are many spurious paths over all predicates!

Some recommendations:

And yes as Nico says, there are always trivial common ancestors even with following only is-a

phenio viz MP:0000001 HP:0000118 UBERON:0001062

gives:

image

this is on 2023-07-11

This looks like what I would expect

matentzn commented 1 year ago

@souzadevinicius can you make sure that you can comfortably run all commands mentioned in this issue, and understand the parameters hinted at by @cmungall in his previous post?

I consider this issue closed as stated, but it is very interesting so I want to make sure @souzadevinicius you understand everything that was discussed here.

justaddcoffee commented 1 year ago

closed per @matentzn's comment just above