monarch-initiative / monarch-semantic-similarity-profiles

MIT License
2 stars 0 forks source link

Why is HP:phenotypic abormality so different from MP:phenotypic abnormality? #7

Open matentzn opened 1 year ago

matentzn commented 1 year ago

In https://github.com/monarch-initiative/semsimian/issues/82#issuecomment-1658950359

@caufieldjh showed us that HP:phenotypic abnormality is very different parents than MP:phenotypic abnormality.

Can we determine why? In particular, why does the HP term have Uberon ancestors?

@caufieldjh I will assign you for now, but feel free to talk to Chris and assign someone else - it is easier for me to work if I can assign while creating the ticket so I am sure its not dropping of the radar.

caufieldjh commented 1 year ago

I think those other sets of ancestors may have been from a previous version of Phenio (the following is from v2023-07-11).

$ runoak -i sqlite:obo:phenio ancestors -p i,p MP:0000001
id      label
BFO:0000001     entity
BFO:0000002     continuant
BFO:0000020     specifically dependent continuant
MP:0000001      mammalian phenotype (MPO)
PATO:0000001    quality
UPHENO:0001001  Phenotype
UPHENO:0001001  phenotype
UPHENO:0001003  phenotype by ontology source

$ runoak -i sqlite:obo:phenio ancestors -p i,p HP:0000118
id      label
BFO:0000001     entity
BFO:0000002     continuant
BFO:0000020     specifically dependent continuant
HP:0000001      All
HP:0000001      All (HPO)
HP:0000118      Phenotypic abnormality
HP:0000118      Phenotypic abnormality (HPO)
PATO:0000001    quality
UPHENO:0001001  Phenotype
UPHENO:0001001  phenotype
UPHENO:0001002  Phenotypic abnormality
UPHENO:0001003  phenotype by ontology source
UPHENO:0001005  abnormal phenotype by ontology source

None of those pesky CARO or UBERON terms. Still not completely parallel, since HP:0000118 -> BFO:0000001 has to traverse UPHENO:0001005 and UPHENO:0001002. The tree view:

$ runoak -i sqlite:obo:phenio tree -p i,p MP:0000001
* [] BFO:0000001 ! entity
    * [i] BFO:0000002 ! continuant
        * [i] BFO:0000020 ! specifically dependent continuant
            * [i] PATO:0000001 ! quality
                * [i] UPHENO:0001001 ! phenotype
                    * [i] UPHENO:0001003 ! phenotype by ontology source
                        * [i] **MP:0000001 ! mammalian phenotype (MPO)**

$ runoak -i sqlite:obo:phenio tree -p i,p HP:0000118
* [] BFO:0000001 ! entity
    * [i] BFO:0000002 ! continuant
        * [i] BFO:0000020 ! specifically dependent continuant
            * [i] PATO:0000001 ! quality
                * [i] UPHENO:0001001 ! phenotype
                    * [i] UPHENO:0001002 ! Phenotypic abnormality
                        * [i] UPHENO:0001005 ! abnormal phenotype by ontology source
                            * [i] **HP:0000118 ! Phenotypic abnormality (HPO)**
                    * [i] UPHENO:0001003 ! phenotype by ontology source
                        * [i] UPHENO:0001005 ! abnormal phenotype by ontology source
                            * [i] **HP:0000118 ! Phenotypic abnormality (HPO)**
* [] HP:0000001 ! All (HPO)
    * [i] **HP:0000118 ! Phenotypic abnormality (HPO)**

The shortest path between the two is still short: ['MP:0000001', 'UPHENO:0001003', 'UPHENO:0001005', 'HP:0000118'] but this is also part of the path between all HP and MP nodes without other shared UPHENO terms in PHENIO. My point being that there is additional distance to cover in the cross-phenotype comparisons, semsim or otherwise.

matentzn commented 1 year ago

Excellent @caufieldjh. Sorry I didnt realise you made that analysis already in this ticket. I guess then CARO/UBERON was not really taken into account, right?

Lets stick with this example here for a minute.

Ok, to restate the problem: When using a lattice, we create a lot of parallel hierarchies, like HPO:All is a distinct parent of HP:PA and not of MP:PA. When using the equivalence model, MP:PA would automatically get HP:PA as a parent.

Now the question is: is that desireable?

I don't know either way right now. If the polyhierarchies are not harmonised (which they are not), the lattice models would result in a large amount of distance between terms which would have been virtually identical after the equivalence model.

cc @souzadevinicius this is going to be the first scientific question we will have to answer!