Closed irbraun closed 2 months ago
I wonder if the set being used to compute Jaccard is missing HP:0001250
itself on one side and is only including the parents? (like accidentally using a non-reflexive closure in one spot)
What do you think @hrshdhgd @justaddcoffee @caufieldjh?
@justaddcoffee and I are discussing this now and we think it's a semsimian
bug.
See https://github.com/monarch-initiative/semsimian/issues/136
We ran a test here with the same parameters as above in semsimian
directly and this is what we are seeing:
Input:
db_path.push(".data/oaklib/phenio.db");
let mut rss = RustSemsimian::new(None, predicates, None, db, None);
rss.update_closure_and_ic_map();
set_1 = HashSet::from([
"HP:0001250".to_string(),
]);
set_2 = HashSet::from([
"HP:0002383".to_string(),
"HP:0002197".to_string(),
"HP:0007359".to_string(),
"HP:0012469".to_string(),
"HP:0000873".to_string(),
"HP:0002046".to_string(),
"HP:0000821".to_string(),
"HP:0011468".to_string(),
"HP:0002444".to_string(),
"HP:0001250".to_string(),
"HP:0002069".to_string(),
"HP:0001252".to_string(),
"HP:0002059".to_string(),
"HP:0001290".to_string(),
]);
let score_metric = MetricEnum::JaccardSimilarity;
let tsps = rss.termset_pairwise_similarity(&entity1, &entity2, &score_metric);
The output we get is:
&tsps = TermsetPairwiseSimilarity {
subject_termset: [
{
"HP:0001250": {
"id": "HP:0001250",
"label": "Seizure (HPO)",
},
},
],
subject_best_matches: {
"HP:0001250": {
"match_source": "HP:0001250",
"match_source_label": "Seizure (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "1",
"score_metric": "jaccard_similarity",
},
},
subject_best_matches_similarity_map: {
"HP:0001250": {
"ancestor_id": "HP:0001250",
"ancestor_information_content": "9.565933710120389",
"ancestor_label": "Seizure (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "1",
"object_id": "HP:0001250",
"phenodigm_score": "3.092884367402116",
"subject_id": "HP:0001250",
},
},
object_termset: [
{
"HP:0001252": {
"id": "HP:0001252",
"label": "Hypotonia (HPO)",
},
},
{
"HP:0002069": {
"id": "HP:0002069",
"label": "Bilateral tonic-clonic seizure (HPO)",
},
},
{
"HP:0001290": {
"id": "HP:0001290",
"label": "Generalized hypotonia (HPO)",
},
},
{
"HP:0002059": {
"id": "HP:0002059",
"label": "Cerebral atrophy (HPO)",
},
},
{
"HP:0002444": {
"id": "HP:0002444",
"label": "Hypothalamic hamartoma (HPO)",
},
},
{
"HP:0002383": {
"id": "HP:0002383",
"label": "Infectious encephalitis (HPO)",
},
},
{
"HP:0001250": {
"id": "HP:0001250",
"label": "Seizure (HPO)",
},
},
{
"HP:0000873": {
"id": "HP:0000873",
"label": "Diabetes insipidus (HPO)",
},
},
{
"HP:0012469": {
"id": "HP:0012469",
"label": "Infantile spasms (HPO)",
},
},
{
"HP:0007359": {
"id": "HP:0007359",
"label": "Focal-onset seizure (HPO)",
},
},
{
"HP:0000821": {
"id": "HP:0000821",
"label": "Hypothyroidism (HPO)",
},
},
{
"HP:0002046": {
"id": "HP:0002046",
"label": "Heat intolerance (HPO)",
},
},
{
"HP:0011468": {
"id": "HP:0011468",
"label": "Facial tics (HPO)",
},
},
{
"HP:0002197": {
"id": "HP:0002197",
"label": "Generalized-onset seizure (HPO)",
},
},
],
object_best_matches: {
"HP:0000821": {
"match_source": "HP:0000821",
"match_source_label": "Hypothyroidism (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.45098039215686275",
"score_metric": "jaccard_similarity",
},
"HP:0000873": {
"match_source": "HP:0000873",
"match_source_label": "Diabetes insipidus (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.5",
"score_metric": "jaccard_similarity",
},
"HP:0001250": {
"match_source": "HP:0001250",
"match_source_label": "Seizure (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "1",
"score_metric": "jaccard_similarity",
},
"HP:0001252": {
"match_source": "HP:0001252",
"match_source_label": "Hypotonia (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.5227272727272727",
"score_metric": "jaccard_similarity",
},
"HP:0001290": {
"match_source": "HP:0001290",
"match_source_label": "Generalized hypotonia (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.5111111111111111",
"score_metric": "jaccard_similarity",
},
"HP:0002046": {
"match_source": "HP:0002046",
"match_source_label": "Heat intolerance (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.3953488372093023",
"score_metric": "jaccard_similarity",
},
"HP:0002059": {
"match_source": "HP:0002059",
"match_source_label": "Cerebral atrophy (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.3157894736842105",
"score_metric": "jaccard_similarity",
},
"HP:0002069": {
"match_source": "HP:0002069",
"match_source_label": "Bilateral tonic-clonic seizure (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.9666666666666667",
"score_metric": "jaccard_similarity",
},
"HP:0002197": {
"match_source": "HP:0002197",
"match_source_label": "Generalized-onset seizure (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.9666666666666667",
"score_metric": "jaccard_similarity",
},
"HP:0002383": {
"match_source": "HP:0002383",
"match_source_label": "Infectious encephalitis (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.5306122448979592",
"score_metric": "jaccard_similarity",
},
"HP:0002444": {
"match_source": "HP:0002444",
"match_source_label": "Hypothalamic hamartoma (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.26666666666666666",
"score_metric": "jaccard_similarity",
},
"HP:0007359": {
"match_source": "HP:0007359",
"match_source_label": "Focal-onset seizure (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.9666666666666667",
"score_metric": "jaccard_similarity",
},
"HP:0011468": {
"match_source": "HP:0011468",
"match_source_label": "Facial tics (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.40425531914893614",
"score_metric": "jaccard_similarity",
},
"HP:0012469": {
"match_source": "HP:0012469",
"match_source_label": "Infantile spasms (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "0.90625",
"score_metric": "jaccard_similarity",
},
},
object_best_matches_similarity_map: {
"HP:0000821": {
"ancestor_id": "UPHENO:0002332",
"ancestor_information_content": "5.031418387926112",
"ancestor_label": "abnormality of anatomical entity physiology",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.45098039215686275",
"object_id": "HP:0001250",
"phenodigm_score": "1.5063435988154126",
"subject_id": "HP:0000821",
},
"HP:0000873": {
"ancestor_id": "HP:0000118",
"ancestor_information_content": "4.010992078140006",
"ancestor_label": "Phenotypic abnormality (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.5",
"object_id": "HP:0001250",
"phenodigm_score": "1.4161553725033151",
"subject_id": "HP:0000873",
},
"HP:0001250": {
"ancestor_id": "HP:0001250",
"ancestor_information_content": "9.565933710120389",
"ancestor_label": "Seizure (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "1",
"object_id": "HP:0001250",
"phenodigm_score": "3.092884367402116",
"subject_id": "HP:0001250",
},
"HP:0001252": {
"ancestor_id": "UPHENO:0002332",
"ancestor_information_content": "5.031418387926112",
"ancestor_label": "abnormality of anatomical entity physiology",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.5227272727272727",
"object_id": "HP:0001250",
"phenodigm_score": "1.621745853045559",
"subject_id": "HP:0001252",
},
"HP:0001290": {
"ancestor_id": "UPHENO:0002332",
"ancestor_information_content": "5.031418387926112",
"ancestor_label": "abnormality of anatomical entity physiology",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.5111111111111111",
"object_id": "HP:0001250",
"phenodigm_score": "1.6036252189080185",
"subject_id": "HP:0001290",
},
"HP:0002046": {
"ancestor_id": "HP:0000118",
"ancestor_information_content": "4.010992078140006",
"ancestor_label": "Phenotypic abnormality (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.3953488372093023",
"object_id": "HP:0001250",
"phenodigm_score": "1.2592621070088523",
"subject_id": "HP:0002046",
},
"HP:0002059": {
"ancestor_id": "HP:0000707",
"ancestor_information_content": "6.560244638238565",
"ancestor_label": "Abnormality of the nervous system (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.3157894736842105",
"object_id": "HP:0001250",
"phenodigm_score": "1.4393249117377982",
"subject_id": "HP:0002059",
},
"HP:0002069": {
"ancestor_id": "HP:0001250",
"ancestor_information_content": "9.565933710120389",
"ancestor_label": "Seizure (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.9666666666666667",
"object_id": "HP:0001250",
"phenodigm_score": "3.040899415159333",
"subject_id": "HP:0002069",
},
"HP:0002197": {
"ancestor_id": "HP:0001250",
"ancestor_information_content": "9.565933710120389",
"ancestor_label": "Seizure (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.9666666666666667",
"object_id": "HP:0001250",
"phenodigm_score": "3.040899415159333",
"subject_id": "HP:0002197",
},
"HP:0002383": {
"ancestor_id": "HP:0000707",
"ancestor_information_content": "6.560244638238565",
"ancestor_label": "Abnormality of the nervous system (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.5306122448979592",
"object_id": "HP:0001250",
"phenodigm_score": "1.865729384068216",
"subject_id": "HP:0002383",
},
"HP:0002444": {
"ancestor_id": "HP:0000707",
"ancestor_information_content": "6.560244638238565",
"ancestor_label": "Abnormality of the nervous system (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.26666666666666666",
"object_id": "HP:0001250",
"phenodigm_score": "1.322648316899451",
"subject_id": "HP:0002444",
},
"HP:0007359": {
"ancestor_id": "HP:0001250",
"ancestor_information_content": "9.565933710120389",
"ancestor_label": "Seizure (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.9666666666666667",
"object_id": "HP:0001250",
"phenodigm_score": "3.040899415159333",
"subject_id": "HP:0007359",
},
"HP:0011468": {
"ancestor_id": "HP:0000118",
"ancestor_information_content": "4.010992078140006",
"ancestor_label": "Phenotypic abnormality (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.40425531914893614",
"object_id": "HP:0001250",
"phenodigm_score": "1.2733675363587462",
"subject_id": "HP:0011468",
},
"HP:0012469": {
"ancestor_id": "HP:0001250",
"ancestor_information_content": "9.565933710120389",
"ancestor_label": "Seizure (HPO)",
"cosine_similarity": "NaN",
"jaccard_similarity": "0.90625",
"object_id": "HP:0001250",
"phenodigm_score": "2.944338198100993",
"subject_id": "HP:0012469",
},
},
average_score: 0.8108479042000829,
best_score: 1.0,
metric: JaccardSimilarity,
}
So we get 1.0 for HP:0001250
in both subject_best_matches
and object_bast_matches
object_best_matches: {
...
"HP:0001250": {
"match_source": "HP:0001250",
"match_source_label": "Seizure (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "1",
"score_metric": "jaccard_similarity",
},
...
}
subject_best_matches: {
"HP:0001250": {
"match_source": "HP:0001250",
"match_source_label": "Seizure (HPO)",
"match_target": "HP:0001250",
"match_target_label": "Seizure (HPO)",
"score": "1",
"score_metric": "jaccard_similarity",
},
},
@irbraun , would it be possible to share your complete output? or @kevinschaper , could it be that the API endpoint is not showing the full result? (I've never used it so my question is out of curiosity)
The portion of the response for the 'subject best match' looks like this, with a score of 0.95, where Seizure in the Subjects is not matched up with Seizure in the Objects:
The comparison here is between match_source
and match_target
and the match target is Generalized-onset seizure
. So it makes sense for it not to be 1.0.
The comparison here is between match_source and match_target and the match target is Generalized-onset seizure. So it makes sense for it not to be 1.0.
Since it's in the set, wouldn't Seizure be a better match at 1.0?
While messing with it, I started to notice that sometimes I got HP:0001250 and other times I didn't. So I tried it 100 times, and something odd is happening
curl -s -X 'GET' \
'http://api-v3.monarchinitiative.org/v3/api/semsim/compare/HP%3A0001250/HP%3A0002383%2CHP%3A0002197%2CHP%3A0007359%2CHP%3A0012469%2CHP%3A0000873%2CHP%3A0002046%2CHP%3A0000821%2CHP%3A0011468%2CHP%3A0002444%2CHP%3A0001250%2CHP%3A0002069%2CHP%3A0001252%2CHP%3A0002059%2CHP%3A0001290?metric=jaccard_similarity' \
-H 'accept: application/json' | jq '.subject_best_matches[].match_target' >> 1250_match_targets.txt
done
cat 1250_match_targets.txt | sort | uniq -c | sort -nr
25 "HP:0002069"
21 "HP:0002197"
20 "HP:0007359"
19 "HP:0001250"
15 "HP:0012469"
Since it's in the set, wouldn't Seizure be a better match at 1.0?
Yes and it probably is in the list of matches. For some odd reason there's just one item in the subject_matches
list when there are many.
While messing with it, I started to notice that sometimes I got HP:0001250 and other times I didn't. So I tried it 100 times, and something odd is happening
My guess is each time it is showing one random element from the list.
The matches are a list of dicts. So they aren't sorted or ordered in any way. their sequence is completely random so my guess about it picking any one from the list may be true.
I had a question about the behavior of the semsim/multicompare endpoint in the API.
I've noticed that sometimes when the same HPO term is present in the Subject and Object lists, that term from the Object is not returned as the 'best match' for that term in the Subject.
Here's an example payload where I've found this to be the case (tested in the FastAPI Swagger UI; 8/16/2024).
The portion of the response for the 'subject best match' looks like this, with a score of 0.95, where Seizure in the Subjects is not matched up with Seizure in the Objects:
However, the 'object best matches' contains the expected match, with a score of 1.00:
I'm wondering if this is because of a known limitation of the
semsimian
algorithm used, or because theancestor_information_content
is equivalent in both cases due to (Seizure,Seizure) and (Seizure,Generalized-onset seizure) sharing the same common ancestor in both cases, even though the Jaccard Similarity is lower in the second case?Or is 'Seizure' in the Object Set not a candidate for the best match, since there is a more specific subclass of it also present in the Object Set?
Please let me know as well if it would make more sense for me to duplicate or move this question to the
semsimian
repository instead.