Closed matentzn closed 1 year ago
@souzadevinicius what was the Y axis of the plot above?
For context, can you let us know what exact command was run and on what version of the ontology?
I agree that those values look strange and I'd also expect them to be higher. For these examples, what is the stock OAK Jaccard (i.e., without semsimian)?
@souzadevinicius can you compute the SemSim values from normal non SemSimian oak with just these example terms?
I welcome all exploratory analysis like this but it should all be in jupyter notebooks in GitHub so everything is transparent and reproducible!
Another plot that may help elucidate is union on one axis and intersection on other
@souzadevinicius what was the Y axis of the plot above?
The y-axis represents the count number of jaccard_similarity
values.
1e7 is the scientific notation for the number: 10000000
Here we have an example without scientific notation due to the smaller count number
We have fewer values here because this analysis was generated passing a 0.7 threshold.
Can you share a link to the jupyter notebook generating this plot? The first plot looks like there is no single value above 0.7, the second suggests there are thousands. Are you using a different ontology for the second? The threshold should only cut off, not change the similarity scores.
Can you share a link to the jupyter notebook generating this plot? The first plot looks like there is no single value above 0.7, the second suggests there are thousands. Are you using a different ontology for the second? The threshold should only cut off, not change the similarity scores.
Sorry, @matentzn . Once the screenshots have no title, and I didn't explain it to you, there is a small confusion. The first screenshot was generated using HPxMP comparison. The last was done using HPxHP
Just remembering some numbers:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
jaccard_similarity | 231537059 | 0.189693 | 0.076021 | 0 | 0.145161 | 0.187500 | 0.234043 | 0.696970 |
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
jaccard_similarity | 284037008 | 0.235657 | 0.116312 | 0 | 0.168831 | 0.229167 | 0.296296 | 1 |
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
jaccard_similarity | 1186052 | 0.799334 | 0.073693 | 0.700637 | 0.736842 | 0.782609 | 0.851852 | 1 |
BTW, here is the link to the jupyter notebook script:
After some discussion with @hrshdhgd I am increasingly willing to believe these are, in fact, correct (or at least not too far from what we'd expect).
id label
BFO:0000001 entity
BFO:0000001 entity
BFO:0000002 continuant
BFO:0000002 continuant
BFO:0000020 specifically dependent continuant
MP:0000001 mammalian phenotype (MPO)
PATO:0000001 quality
UPHENO:0001001 Phenotype
UPHENO:0001001 phenotype
UPHENO:0001003 phenotype by ontology source
Ancestors of HP:0000118:
id label
BFO:0000001 entity
BFO:0000001 entity
BFO:0000002 continuant
BFO:0000002 continuant
BFO:0000004 independent continuant
BFO:0000004 independent continuant
BFO:0000020 specifically dependent continuant
BFO:0000040 material entity
BFO:0000040 material entity
CARO:0000000 anatomical entity
CARO:0000000 anatomical entity
CARO:0000003 anatomical structure
CARO:0000003 connected anatomical structure
CARO:0000003 connected anatomical structure
CARO:0000006 material anatomical entity
CARO:0000006 material anatomical entity
CARO:0000012 multicellular organism
CARO:0001008 gross anatomical part
CARO:0001010 organism or virus or viroid
CARO:0010000 multicellular anatomical structure
CARO:0010004 cellular organism
CARO:0030000 biological entity
HP:0000001 All
HP:0000001 All (HPO)
HP:0000118 Phenotypic abnormality
HP:0000118 Phenotypic abnormality (HPO)
PATO:0000001 quality
UBERON:0000061 anatomical structure
UBERON:0000465 material anatomical entity
UBERON:0000468 multicellular organism
UBERON:0001062 anatomical entity
UBERON:0010000 multicellular anatomical structure
UPHENO:0001001 Phenotype
UPHENO:0001001 phenotype
UPHENO:0001002 Phenotypic abnormality
UPHENO:0001003 phenotype by ontology source
UPHENO:0001005 abnormal phenotype by ontology source
My naïve assumption is that HP:0000118 and MP:0000001 would have very similar jaccard values, but their calculated intersection may be lower than expected due to cases like the HP term's path to the same root as MP involving additional terms like UPHENO:0001005.
😱
I opened this ticket in response:
https://github.com/monarch-initiative/monarch-semantic-similarity-profiles/issues/7
Can we still provide an answer to:
Can we get a list of all features that went into the Jaccard computation? Apart from subclass? And also, where can I read about the jaccard formula implemented in semsimian?
In other words: what do you mean by "ancestor"?
Ancestor being a member of the transitive closure of all is-a and part-of relationships. This is by a necessity part of the Jaccard calculation, and in Semsimian that's here: https://github.com/monarch-initiative/semsimian/blob/f9d53f0295f7bdd033b43bdebfab9a401ae1e95e/src/similarity.rs#L29-L35 (this operates on sets of strings; there's an equivalent function below it for sets of integers and I suspect that's faster)
So that's the total of intersecting members between the sets over the total members of both sets. Jaccard similarity, yes indeed.
But where do those sets come from?
Given a pair of entities and a set of predicates, they're looked up from a predefined closure table: https://github.com/monarch-initiative/semsimian/blob/f9d53f0295f7bdd033b43bdebfab9a401ae1e95e/src/similarity.rs#L10-L27
But where does that closure table come from?
It's loaded from the sqlite representation, as that already contains closures and we don't want to have to instantiate those from scratch every time we calc semsim.
Thats great, thanks.
- What if we want to use relationships other than part of and is a? How much of an architectural change would that be?
No change at all! Semsimian just updates the closure map based on whatever predicate types are required.
- So basically what you are saying is: somehow semsql thinks that HPO:0000118 is a part of UBERON:anatomical entity?
Based on https://github.com/monarch-initiative/monarch-semantic-similarity-profiles/issues/7 I think the above results are from an older PHENIO version. I no longer see any path between HPO:0000118 and UBERON:0001062 in PHENIO. There is a shared ancestor between MP:0000001 and UBERON:0001062, though:
$ runoak -i sqlite:obo:phenio paths MP:0000001 @ UBERON:0001062
subject subject_label object object_label path path_label
MP:0000001 mammalian phenotype (MPO) UBERON:0001062 anatomical entity ['MP:0000001', 'UPHENO:0001003', 'UPHENO:0001001', 'PATO:0000001', 'BFO:0000020', 'BFO:0000002', 'BFO:0000004', 'UBERON:0001062'] mammalian phenotype (MPO)|phenotype by ontology source|Phenotype|quality|specifically dependent continuant|continuant|independent continuant|anatomical entity
$ runoak -i sqlite:obo:phenio paths MP:0000001 @ UBERON:0001062
subject subject_label object object_label path path_label
MP:0000001 mammalian phenotype (MPO) UBERON:0001062 anatomical entity ['MP:0000001', 'UPHENO:0001003', 'UPHENO:0001001', 'PATO:0000001', 'BFO:0000020', 'BFO:0000002', 'BFO:0000004', 'UBERON:0001062'] mammalian phenotype (MPO)|phenotype by ontology source|Phenotype|quality|specifically dependent continuant|continuant|independent continuant|anatomical entity
This makes no sense it should look something like:
This part of the path is wrong:
BFO:0000002 (up until BFO:0000002 everything is correct) --> 'BFO:0000004' (BFO:4 is a subclass of BFO:2 - not the other way around).
'BFO:0000004' --> 'UBERON:0001062' ---> There is absolutely no connection between the two that would make sense, the than perhaps that UBERON:0001062 is a subclass of BFO:0000004.
I am assuming here the path in your output was sort of in the right order..
BFO:0000002 (up until BFO:0000002 everything is correct) --> 'BFO:0000004' (BFO:4 is a subclass of BFO:2 - not the other way around).
Yes - BFO:0000002 is the common ancestor, so the only way you can get to UBERON:0001062 is by going back down the hierarchy. So if UBERON:0001062 is intended to be a subclass of BFO:0000004 (and that is how UBERON defines it) then this makes sense to me except for the fact that HP doesn't do the same - there's not even a path between HPO:0000118 and BFO:0000002.
We dont really care I guess that there is a shared ancestor between MP:PA and UBERON:AE - every two entities in PHENIO (or most) will have BFO:entity as a shared ancestor.
My question is more: how is it possible that HP:0000118 (or MP) have UBERON:AE as an ancestor? This is what your above comment seems to suggest: https://github.com/monarch-initiative/semsimian/issues/82#issuecomment-1658950359
Even if it is an old version, something is going very wrong if that is the case.
- What if we want to use relationships other than part of and is a? How much of an architectural change would that be?
No change at all! Semsimian just updates the closure map based on whatever predicate types are required.
- So basically what you are saying is: somehow semsql thinks that HPO:0000118 is a part of UBERON:anatomical entity?
Based on monarch-initiative/monarch-semantic-similarity-profiles#7 I think the above results are from an older PHENIO version. I no longer see any path between HPO:0000118 and UBERON:0001062 in PHENIO. There is a shared ancestor between MP:0000001 and UBERON:0001062, though:
$ runoak -i sqlite:obo:phenio paths MP:0000001 @ UBERON:0001062 subject subject_label object object_label path path_label MP:0000001 mammalian phenotype (MPO) UBERON:0001062 anatomical entity ['MP:0000001', 'UPHENO:0001003', 'UPHENO:0001001', 'PATO:0000001', 'BFO:0000020', 'BFO:0000002', 'BFO:0000004', 'UBERON:0001062'] mammalian phenotype (MPO)|phenotype by ontology source|Phenotype|quality|specifically dependent continuant|continuant|independent continuant|anatomical entity
In fact, these results were produced using phenio version 2023-05-01.
Remember paths
operates over all predicates by default, just like any oak graph command. There are many spurious paths over all predicates!
Some recommendations:
--include-predicates
to show edge labels--predicates
explicitly--viz
to visualize the pathAnd yes as Nico says, there are always trivial common ancestors even with following only is-a
phenio viz MP:0000001 HP:0000118 UBERON:0001062
gives:
this is on 2023-07-11
This looks like what I would expect
@souzadevinicius can you make sure that you can comfortably run all commands mentioned in this issue, and understand the parameters hinted at by @cmungall in his previous post?
I consider this issue closed as stated, but it is very interesting so I want to make sure @souzadevinicius you understand everything that was discussed here.
closed per @matentzn's comment just above
Can we get a high level description of how the jaccard score is computed?
We note two oddities to start with:
Score stats:
Examples:
All of these should:
Before we even decide what to do here (we can always normalise after the fact):
Can we get a list of all features that went into the Jaccard computation? Apart from subclass? And also, where can I read about the jaccard formula implemented in semsimian?
cc @souzadevinicius