Open bgyori opened 3 years ago
Interesting. In the case of InterPro mappings, most (all?) of them were automatically generated using the script import/interpro_mappings.py
which looked at the gene-level members and created mappings only if the members were an exact match between FamPlex and InterPro. In the code:
(when the mappings were added we used the default Jaccard index threshold of 1).
I see, that makes sense in the sense that the same set of Hedgehog proteins could have a "Hint domain C-terminal", still, semantically that probably shouldn't be curated as an equivalence. What if we differentiated family and domain entries in Interpro and only added family equivalences?
That would definitely make sense if we could get that information systematically.
I found that there is a large number of equivalences in
equivalences.csv
that are not exact matches, for instance, in the case of InterPro mappings. As an example, takeFPLX:Hedgehog
which is mapped to 6 different InterPro entries. One that looks exact is https://www.ebi.ac.uk/interpro/entry/InterPro/IPR001657/ (Hedgehog protein) but the others include e.g., https://www.ebi.ac.uk/interpro/entry/InterPro/IPR001767/ (Hedgehog protein, Hint domain) and https://www.ebi.ac.uk/interpro/entry/InterPro/IPR003586/ (Hint domain C-terminal) which, I don't think should be considered equivalences. I suspect that these might have been added with the goal of adding as many IP->FPLX mappings as possible from sources that produce various groundings in InterPro. Still they are misleading if interpreted in the opposite direction.