Pharmacogenomics - mappig ChEBI identifiers to ChEMBL

opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal

https://platform.opentargets.org https://genetics.opentargets.org

Apache License 2.0

12 stars 2 forks source link

Pharmacogenomics - mappig ChEBI identifiers to ChEMBL #3116

Closed DSuveges closed 3 months ago

DSuveges commented 9 months ago

When ingesting PharmGKB data, EVA team uses OLS to map drug labels to ChEBI molecule ontology. They have been using OLS to map disease labels to EFO before, so the applicable drug ontology makes sense for them. The problem is that we might not have chebi cross-references so we have to investigate now to normalize to ChEMBL.

buniello commented 9 months ago

Thanks @DSuveges, I am assigning to Manuel as placeholder for discussion on monday

d0choa commented 9 months ago

UniChem contains the cross-references. I can't remember if we use Unichem in production. @ireneisdoomed will know more about this for sure

DSuveges commented 9 months ago

UniChem contains the cross-references. I can't remember if we use Unichem in production. @ireneisdoomed will know more about this for sure

I checked our molecule dataset and has no xrefs to unichem (neither UI). Unfortunately it seems OLS also can't provide the cross refs. So EVA or us has to go to ChEBI or i assume, we can incorporate this xref when the molecule index is build.

However I think this problem is tangentially related to the issue we have raised multiple times: the ETL could (should) normalize all main entities (disease/target/drug). The infrastructure is there, we use them for normalize literature entities. I don't know much about the implications though.

jdhayhurst commented 9 months ago

Hi @DSuveges do you have the ChEBI terms that require mapping? I may have misunderstood the task here, but you can map any ChEBI to ChEMBL using the unichem API, e.g. for NIRAPARIB (ChEBI: 176844) you'd do:

curl -X POST "https://www.ebi.ac.uk/unichem/api/v1/compounds" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"compound\": \"176844\", \"sourceID\": 7, \"type\": \"sourceID\"}" | jq '.compounds[0].sources[0].compoundId'
#>>> "CHEMBL1094636"

Or assuming we have many ids to map, a better option is probably to use the 'src1src7.txt.gz' source mapping file from here, which is just two columns, one for ChEBI ids, the other for ChEMBL.

If the mapping returned multiple compounds, I'm not sure how to resolve it to a single identifier, but I'm not sure if that scenario will occur with our inputs - we'd need to check.

DSuveges commented 9 months ago

The pubchem mapping file is available from ftp. This can be used to genereate cross references to ChEMBL. However it seems the coverage is not 100%: 54 ChEBI identifier from the pgx dataset cannot be mapped to ChEMBL, which means losing 7643 entries (~30%).

A manual check of some of these molecules indicates they do exist in ChEMBL. It needs to be explored.

Notebook is here

~~Also, there are 4k pgx evidence with no ChEBI identifer either. I'm double check this with EVA.~~ it's known and expected.

DSuveges commented 9 months ago

30% of the chebi ids cannot be resolved by the mapping file. I was wondering if the API works better, but it seems it doesn't:

CHEBI_ID=87715

curl -X POST "https://www.ebi.ac.uk/unichem/api/v1/compounds" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{ \"compound\": \"${CHEBI_ID}\", \"sourceID\": 7, \"type\": \"sourceID\"}" | 
    jq '.compounds[0].sources[0].compoundId'

returns nothing. However CHEBI_ID=87715 is a

DSuveges commented 9 months ago

Yes, I could verify that our drug index do have cross-references to chEBI, however the two sets of chEBI-chEMBL pairs are not comparable:

+--------------------+--------+--------------+
|mapping_source      |unmapped|unmapped chEBI|
+--------------------+--------+--------------+
|drug_index          |7346    |91            |
|UniChem             |7643    |52            |
|drug index + UniChem|6462    |45            |
+--------------------+--------+--------------+

(see details under the gist)

DSuveges commented 9 months ago

chEBI IDs with no mappign in any source:

+------------+
|      drugId|
+------------+
| CHEBI_33234|
| CHEBI_31899|
| CHEBI_31859|
|CHEBI_145221|
|  CHEBI_9648|
|  CHEBI_8656|
|        null|
| CHEBI_87715|
| CHEBI_48432|
| CHEBI_31941|
|  CHEBI_9011|
| CHEBI_37988|
|  CHEBI_3723|
|  CHEBI_6887|
| CHEBI_91749|
| CHEBI_22198|
| CHEBI_82978|
| CHEBI_48923|
| CHEBI_28384|
| CHEBI_17833|
| CHEBI_18723|
|  CHEBI_3160|
|  CHEBI_6046|
| CHEBI_22984|
| CHEBI_22907|
| CHEBI_23965|
|  CHEBI_9513|
|  CHEBI_5938|
| CHEBI_15847|
|  CHEBI_5264|
+------------+

ireneisdoomed commented 9 months ago

Resolving drugs to a drugId in the ETL will happen 23.04 For the time being, mapping to ChEMBL is handed by the data team:

drugFromSourceId: ChEBI identifier from PharmGKB
drugId: ChEMBL identifier using crossreferences from the latest molecule dataset

ireneisdoomed commented 7 months ago

With the ChEBI to ChEMBL solution, we have mapped 90% of the high confidence evidence.

The outstanding cases are due to 7 ChEBIs. This is an example shared by @d0choa and Ellie:

For this case we do have a ChEBI cross reference, but it point to slightly different molecules: PharmGKB points to CHEBI:10033, the parent structure for warfarin. ChEMBL points to CHEBI:87732, a stereoisomer. PharmGKB's mapping looks more sensible to me, so I'll ask ChEMBL about it.

Independently, my hypothesis is that we can get a full coverage just by using the drug name.

jdhayhurst commented 6 months ago

pyspark code that needs to be implemented into the ETL is here