monarch-initiative / monarch-app

Monarch Initiative website and API
https://monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
18 stars 6 forks source link

Unexpected 'best match' returned for term set comparison #793

Closed irbraun closed 2 months ago

irbraun commented 2 months ago

I had a question about the behavior of the semsim/multicompare endpoint in the API.

I've noticed that sometimes when the same HPO term is present in the Subject and Object lists, that term from the Object is not returned as the 'best match' for that term in the Subject.

Here's an example payload where I've found this to be the case (tested in the FastAPI Swagger UI; 8/16/2024).

{
    "metric": "jaccard_similarity",
    "subjects": ["HP:0001250"],
    "object_sets": [
        {
        "id": "Example",
        "label": "Example",
        "phenotypes": [              
                "HP:0002383",
                "HP:0002197",
                "HP:0007359",
                "HP:0012469",
                "HP:0000873",
                "HP:0002046",
                "HP:0000821",
                "HP:0011468",
                "HP:0002444",
                "HP:0001250",
                "HP:0002069",
                "HP:0001252",
                "HP:0002059",
                "HP:0001290"]
        }
    ]
}

The portion of the response for the 'subject best match' looks like this, with a score of 0.95, where Seizure in the Subjects is not matched up with Seizure in the Objects:

      "subject_best_matches": {
        "HP:0001250": {
          "match_source": "HP:0001250",
          "match_source_label": "Seizure",
          "match_target": "HP:0002197",
          "match_target_label": "Generalized-onset seizure",
          "score": 0.95,
          "match_subsumer": null,
          "match_subsumer_label": null,
          "similarity": {
            "subject_id": "HP:0001250",
            "subject_label": null,
            "subject_source": null,
            "object_id": "HP:0002197",
            "object_label": null,
            "object_source": null,
            "ancestor_id": "HP:0001250",
            "ancestor_label": "Seizure",
            "ancestor_source": null,
            "object_information_content": null,
            "subject_information_content": null,
            "ancestor_information_content": 9.749166272783704,
            "jaccard_similarity": 0.95,
            "cosine_similarity": null,
            "dice_similarity": null,
            "phenodigm_score": 3.043305433101403
          },
          "score_metric": "jaccard_similarity"
        }
      },

However, the 'object best matches' contains the expected match, with a score of 1.00:

...
       "HP:0001250": {
          "match_source": "HP:0001250",
          "match_source_label": "Seizure",
          "match_target": "HP:0001250",
          "match_target_label": "Seizure",
          "score": 1,
          "match_subsumer": null,
          "match_subsumer_label": null,
          "similarity": {
            "subject_id": "HP:0001250",
            "subject_label": null,
            "subject_source": null,
            "object_id": "HP:0001250",
            "object_label": null,
            "object_source": null,
            "ancestor_id": "HP:0001250",
            "ancestor_label": "Seizure",
            "ancestor_source": null,
            "object_information_content": null,
            "subject_information_content": null,
            "ancestor_information_content": 9.749166272783704,
            "jaccard_similarity": 1,
            "cosine_similarity": null,
            "dice_similarity": null,
            "phenodigm_score": 3.122365493145174
          },
          "score_metric": "jaccard_similarity"
        },.
...

I'm wondering if this is because of a known limitation of the semsimian algorithm used, or because the ancestor_information_content is equivalent in both cases due to (Seizure,Seizure) and (Seizure,Generalized-onset seizure) sharing the same common ancestor in both cases, even though the Jaccard Similarity is lower in the second case?

Or is 'Seizure' in the Object Set not a candidate for the best match, since there is a more specific subclass of it also present in the Object Set?

Please let me know as well if it would make more sense for me to duplicate or move this question to the semsimian repository instead.

kevinschaper commented 2 months ago

I wonder if the set being used to compute Jaccard is missing HP:0001250 itself on one side and is only including the parents? (like accidentally using a non-reflexive closure in one spot)

What do you think @hrshdhgd @justaddcoffee @caufieldjh?

caufieldjh commented 2 months ago

@justaddcoffee and I are discussing this now and we think it's a semsimian bug.

See https://github.com/monarch-initiative/semsimian/issues/136

hrshdhgd commented 2 months ago

We ran a test here with the same parameters as above in semsimian directly and this is what we are seeing:

Input:

db_path.push(".data/oaklib/phenio.db");
let mut rss = RustSemsimian::new(None, predicates, None, db, None);
rss.update_closure_and_ic_map();
set_1 = HashSet::from([
            "HP:0001250".to_string(),
        ]);
set_2 = HashSet::from([
            "HP:0002383".to_string(),
            "HP:0002197".to_string(),
            "HP:0007359".to_string(),
            "HP:0012469".to_string(),
            "HP:0000873".to_string(),
            "HP:0002046".to_string(),
            "HP:0000821".to_string(),
            "HP:0011468".to_string(),
            "HP:0002444".to_string(),
            "HP:0001250".to_string(),
            "HP:0002069".to_string(),
            "HP:0001252".to_string(),
            "HP:0002059".to_string(),
            "HP:0001290".to_string(),
        ]);
let score_metric = MetricEnum::JaccardSimilarity;
let tsps = rss.termset_pairwise_similarity(&entity1, &entity2, &score_metric);

The output we get is:

&tsps = TermsetPairwiseSimilarity {
    subject_termset: [
        {
            "HP:0001250": {
                "id": "HP:0001250",
                "label": "Seizure (HPO)",
            },
        },
    ],
    subject_best_matches: {
        "HP:0001250": {
            "match_source": "HP:0001250",
            "match_source_label": "Seizure (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "1",
            "score_metric": "jaccard_similarity",
        },
    },
    subject_best_matches_similarity_map: {
        "HP:0001250": {
            "ancestor_id": "HP:0001250",
            "ancestor_information_content": "9.565933710120389",
            "ancestor_label": "Seizure (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "1",
            "object_id": "HP:0001250",
            "phenodigm_score": "3.092884367402116",
            "subject_id": "HP:0001250",
        },
    },
    object_termset: [
        {
            "HP:0001252": {
                "id": "HP:0001252",
                "label": "Hypotonia (HPO)",
            },
        },
        {
            "HP:0002069": {
                "id": "HP:0002069",
                "label": "Bilateral tonic-clonic seizure (HPO)",
            },
        },
        {
            "HP:0001290": {
                "id": "HP:0001290",
                "label": "Generalized hypotonia (HPO)",
            },
        },
        {
            "HP:0002059": {
                "id": "HP:0002059",
                "label": "Cerebral atrophy (HPO)",
            },
        },
        {
            "HP:0002444": {
                "id": "HP:0002444",
                "label": "Hypothalamic hamartoma (HPO)",
            },
        },
        {
            "HP:0002383": {
                "id": "HP:0002383",
                "label": "Infectious encephalitis (HPO)",
            },
        },
        {
            "HP:0001250": {
                "id": "HP:0001250",
                "label": "Seizure (HPO)",
            },
        },
        {
            "HP:0000873": {
                "id": "HP:0000873",
                "label": "Diabetes insipidus (HPO)",
            },
        },
        {
            "HP:0012469": {
                "id": "HP:0012469",
                "label": "Infantile spasms (HPO)",
            },
        },
        {
            "HP:0007359": {
                "id": "HP:0007359",
                "label": "Focal-onset seizure (HPO)",
            },
        },
        {
            "HP:0000821": {
                "id": "HP:0000821",
                "label": "Hypothyroidism (HPO)",
            },
        },
        {
            "HP:0002046": {
                "id": "HP:0002046",
                "label": "Heat intolerance (HPO)",
            },
        },
        {
            "HP:0011468": {
                "id": "HP:0011468",
                "label": "Facial tics (HPO)",
            },
        },
        {
            "HP:0002197": {
                "id": "HP:0002197",
                "label": "Generalized-onset seizure (HPO)",
            },
        },
    ],
    object_best_matches: {
        "HP:0000821": {
            "match_source": "HP:0000821",
            "match_source_label": "Hypothyroidism (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.45098039215686275",
            "score_metric": "jaccard_similarity",
        },
        "HP:0000873": {
            "match_source": "HP:0000873",
            "match_source_label": "Diabetes insipidus (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.5",
            "score_metric": "jaccard_similarity",
        },
        "HP:0001250": {
            "match_source": "HP:0001250",
            "match_source_label": "Seizure (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "1",
            "score_metric": "jaccard_similarity",
        },
        "HP:0001252": {
            "match_source": "HP:0001252",
            "match_source_label": "Hypotonia (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.5227272727272727",
            "score_metric": "jaccard_similarity",
        },
        "HP:0001290": {
            "match_source": "HP:0001290",
            "match_source_label": "Generalized hypotonia (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.5111111111111111",
            "score_metric": "jaccard_similarity",
        },
        "HP:0002046": {
            "match_source": "HP:0002046",
            "match_source_label": "Heat intolerance (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.3953488372093023",
            "score_metric": "jaccard_similarity",
        },
        "HP:0002059": {
            "match_source": "HP:0002059",
            "match_source_label": "Cerebral atrophy (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.3157894736842105",
            "score_metric": "jaccard_similarity",
        },
        "HP:0002069": {
            "match_source": "HP:0002069",
            "match_source_label": "Bilateral tonic-clonic seizure (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.9666666666666667",
            "score_metric": "jaccard_similarity",
        },
        "HP:0002197": {
            "match_source": "HP:0002197",
            "match_source_label": "Generalized-onset seizure (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.9666666666666667",
            "score_metric": "jaccard_similarity",
        },
        "HP:0002383": {
            "match_source": "HP:0002383",
            "match_source_label": "Infectious encephalitis (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.5306122448979592",
            "score_metric": "jaccard_similarity",
        },
        "HP:0002444": {
            "match_source": "HP:0002444",
            "match_source_label": "Hypothalamic hamartoma (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.26666666666666666",
            "score_metric": "jaccard_similarity",
        },
        "HP:0007359": {
            "match_source": "HP:0007359",
            "match_source_label": "Focal-onset seizure (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.9666666666666667",
            "score_metric": "jaccard_similarity",
        },
        "HP:0011468": {
            "match_source": "HP:0011468",
            "match_source_label": "Facial tics (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.40425531914893614",
            "score_metric": "jaccard_similarity",
        },
        "HP:0012469": {
            "match_source": "HP:0012469",
            "match_source_label": "Infantile spasms (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "0.90625",
            "score_metric": "jaccard_similarity",
        },
    },
    object_best_matches_similarity_map: {
        "HP:0000821": {
            "ancestor_id": "UPHENO:0002332",
            "ancestor_information_content": "5.031418387926112",
            "ancestor_label": "abnormality of anatomical entity physiology",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.45098039215686275",
            "object_id": "HP:0001250",
            "phenodigm_score": "1.5063435988154126",
            "subject_id": "HP:0000821",
        },
        "HP:0000873": {
            "ancestor_id": "HP:0000118",
            "ancestor_information_content": "4.010992078140006",
            "ancestor_label": "Phenotypic abnormality (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.5",
            "object_id": "HP:0001250",
            "phenodigm_score": "1.4161553725033151",
            "subject_id": "HP:0000873",
        },
        "HP:0001250": {
            "ancestor_id": "HP:0001250",
            "ancestor_information_content": "9.565933710120389",
            "ancestor_label": "Seizure (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "1",
            "object_id": "HP:0001250",
            "phenodigm_score": "3.092884367402116",
            "subject_id": "HP:0001250",
        },
        "HP:0001252": {
            "ancestor_id": "UPHENO:0002332",
            "ancestor_information_content": "5.031418387926112",
            "ancestor_label": "abnormality of anatomical entity physiology",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.5227272727272727",
            "object_id": "HP:0001250",
            "phenodigm_score": "1.621745853045559",
            "subject_id": "HP:0001252",
        },
        "HP:0001290": {
            "ancestor_id": "UPHENO:0002332",
            "ancestor_information_content": "5.031418387926112",
            "ancestor_label": "abnormality of anatomical entity physiology",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.5111111111111111",
            "object_id": "HP:0001250",
            "phenodigm_score": "1.6036252189080185",
            "subject_id": "HP:0001290",
        },
        "HP:0002046": {
            "ancestor_id": "HP:0000118",
            "ancestor_information_content": "4.010992078140006",
            "ancestor_label": "Phenotypic abnormality (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.3953488372093023",
            "object_id": "HP:0001250",
            "phenodigm_score": "1.2592621070088523",
            "subject_id": "HP:0002046",
        },
        "HP:0002059": {
            "ancestor_id": "HP:0000707",
            "ancestor_information_content": "6.560244638238565",
            "ancestor_label": "Abnormality of the nervous system (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.3157894736842105",
            "object_id": "HP:0001250",
            "phenodigm_score": "1.4393249117377982",
            "subject_id": "HP:0002059",
        },
        "HP:0002069": {
            "ancestor_id": "HP:0001250",
            "ancestor_information_content": "9.565933710120389",
            "ancestor_label": "Seizure (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.9666666666666667",
            "object_id": "HP:0001250",
            "phenodigm_score": "3.040899415159333",
            "subject_id": "HP:0002069",
        },
        "HP:0002197": {
            "ancestor_id": "HP:0001250",
            "ancestor_information_content": "9.565933710120389",
            "ancestor_label": "Seizure (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.9666666666666667",
            "object_id": "HP:0001250",
            "phenodigm_score": "3.040899415159333",
            "subject_id": "HP:0002197",
        },
        "HP:0002383": {
            "ancestor_id": "HP:0000707",
            "ancestor_information_content": "6.560244638238565",
            "ancestor_label": "Abnormality of the nervous system (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.5306122448979592",
            "object_id": "HP:0001250",
            "phenodigm_score": "1.865729384068216",
            "subject_id": "HP:0002383",
        },
        "HP:0002444": {
            "ancestor_id": "HP:0000707",
            "ancestor_information_content": "6.560244638238565",
            "ancestor_label": "Abnormality of the nervous system (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.26666666666666666",
            "object_id": "HP:0001250",
            "phenodigm_score": "1.322648316899451",
            "subject_id": "HP:0002444",
        },
        "HP:0007359": {
            "ancestor_id": "HP:0001250",
            "ancestor_information_content": "9.565933710120389",
            "ancestor_label": "Seizure (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.9666666666666667",
            "object_id": "HP:0001250",
            "phenodigm_score": "3.040899415159333",
            "subject_id": "HP:0007359",
        },
        "HP:0011468": {
            "ancestor_id": "HP:0000118",
            "ancestor_information_content": "4.010992078140006",
            "ancestor_label": "Phenotypic abnormality (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.40425531914893614",
            "object_id": "HP:0001250",
            "phenodigm_score": "1.2733675363587462",
            "subject_id": "HP:0011468",
        },
        "HP:0012469": {
            "ancestor_id": "HP:0001250",
            "ancestor_information_content": "9.565933710120389",
            "ancestor_label": "Seizure (HPO)",
            "cosine_similarity": "NaN",
            "jaccard_similarity": "0.90625",
            "object_id": "HP:0001250",
            "phenodigm_score": "2.944338198100993",
            "subject_id": "HP:0012469",
        },
    },
    average_score: 0.8108479042000829,
    best_score: 1.0,
    metric: JaccardSimilarity,
}

So we get 1.0 for HP:0001250 in both subject_best_matches and object_bast_matches

 object_best_matches: {
...
"HP:0001250": {
            "match_source": "HP:0001250",
            "match_source_label": "Seizure (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "1",
            "score_metric": "jaccard_similarity",
        },
...
}

subject_best_matches: {
        "HP:0001250": {
            "match_source": "HP:0001250",
            "match_source_label": "Seizure (HPO)",
            "match_target": "HP:0001250",
            "match_target_label": "Seizure (HPO)",
            "score": "1",
            "score_metric": "jaccard_similarity",
        },
    },

@irbraun , would it be possible to share your complete output? or @kevinschaper , could it be that the API endpoint is not showing the full result? (I've never used it so my question is out of curiosity)

hrshdhgd commented 2 months ago

The portion of the response for the 'subject best match' looks like this, with a score of 0.95, where Seizure in the Subjects is not matched up with Seizure in the Objects:

The comparison here is between match_source and match_target and the match target is Generalized-onset seizure. So it makes sense for it not to be 1.0.

kevinschaper commented 2 months ago

The comparison here is between match_source and match_target and the match target is Generalized-onset seizure. So it makes sense for it not to be 1.0.

Since it's in the set, wouldn't Seizure be a better match at 1.0?

While messing with it, I started to notice that sometimes I got HP:0001250 and other times I didn't. So I tried it 100 times, and something odd is happening

curl -s -X 'GET' \
  'http://api-v3.monarchinitiative.org/v3/api/semsim/compare/HP%3A0001250/HP%3A0002383%2CHP%3A0002197%2CHP%3A0007359%2CHP%3A0012469%2CHP%3A0000873%2CHP%3A0002046%2CHP%3A0000821%2CHP%3A0011468%2CHP%3A0002444%2CHP%3A0001250%2CHP%3A0002069%2CHP%3A0001252%2CHP%3A0002059%2CHP%3A0001290?metric=jaccard_similarity' \
  -H 'accept: application/json' | jq '.subject_best_matches[].match_target' >> 1250_match_targets.txt
done
cat 1250_match_targets.txt | sort | uniq -c | sort -nr
  25 "HP:0002069"
  21 "HP:0002197"
  20 "HP:0007359"
  19 "HP:0001250"
  15 "HP:0012469"
hrshdhgd commented 2 months ago

Since it's in the set, wouldn't Seizure be a better match at 1.0?

Yes and it probably is in the list of matches. For some odd reason there's just one item in the subject_matches list when there are many.

While messing with it, I started to notice that sometimes I got HP:0001250 and other times I didn't. So I tried it 100 times, and something odd is happening

My guess is each time it is showing one random element from the list.

The matches are a list of dicts. So they aren't sorted or ordered in any way. their sequence is completely random so my guess about it picking any one from the list may be true.