uniprot / enzymeportal

The EBI Enzyme Portal
http://www.ebi.ac.uk/enzymeportal/
Apache License 2.0
11 stars 4 forks source link

Redundant diseases #187

Open rafael-alcantara opened 10 years ago

rafael-alcantara commented 10 years ago

There are entries related to more than one disease, but actually some of them are exactly the same ("Alzheimer disease" [MeSH D000544] and "Alzheimer's disease" [EFO_0000249]) and show (almost) identical names and descriptions.

This is due to the fact that for our mega-map we query BioPortal using MeSH terms in order to get EFO identifiers corresponding to identical names.

We may store equivalent identifiers (unification cross-references) in our database, but then we should store this meta-data and use it for the presentation layer to avoid redundancy.

rafael-alcantara commented 10 years ago

Each line in the UniMed file corresponds to one disease line in an UniProt entry.

From the MeSH IDs we try to get EFO IDs using the BioPortal web services, by exact match of the disease name. This does not always work, so we have fewer EFO IDs than MeSH IDs but we can be sure they are equivalent. The advantage of EFO entries is that they provide a definition of the disease. However, discarding them would not be very harmful as their MeSH equivalents are already there.

I suggest to ignore EFO IDs, try to ignore only those MeSH IDs which are equivalent to existing MIM IDs (risky) and perhaps retrieve disease descriptions from somewhere other than BioPortal (the new version of their web service lacks of some definitions which used to be there).

Any other ideas?