Non-exact equivalences - Githubissues

bgyori commented 3 years ago

I found that there is a large number of equivalences in equivalences.csv that are not exact matches, for instance, in the case of InterPro mappings. As an example, take FPLX:Hedgehog which is mapped to 6 different InterPro entries. One that looks exact is https://www.ebi.ac.uk/interpro/entry/InterPro/IPR001657/ (Hedgehog protein) but the others include e.g., https://www.ebi.ac.uk/interpro/entry/InterPro/IPR001767/ (Hedgehog protein, Hint domain) and https://www.ebi.ac.uk/interpro/entry/InterPro/IPR003586/ (Hint domain C-terminal) which, I don't think should be considered equivalences. I suspect that these might have been added with the goal of adding as many IP->FPLX mappings as possible from sources that produce various groundings in InterPro. Still they are misleading if interpreted in the opposite direction.

johnbachman commented 3 years ago

Interesting. In the case of InterPro mappings, most (all?) of them were automatically generated using the script import/interpro_mappings.py which looked at the gene-level members and created mappings only if the members were an exact match between FamPlex and InterPro. In the code:

https://github.com/sorgerlab/famplex/blob/ca2f88585f596d2a4ef0e411137f2e4d00eb2208/import/interpro_mappings.py#L89

(when the mappings were added we used the default Jaccard index threshold of 1).

bgyori commented 3 years ago

I see, that makes sense in the sense that the same set of Hedgehog proteins could have a "Hint domain C-terminal", still, semantically that probably shouldn't be curated as an equivalence. What if we differentiated family and domain entries in Interpro and only added family equivalences?

johnbachman commented 3 years ago

That would definitely make sense if we could get that information systematically.

sorgerlab / famplex

Non-exact equivalences #159