opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Remove crystallising molecules from the drug dataset obtain from the OT Drug parquet #1973

Closed MarineGirardey closed 1 year ago

MarineGirardey commented 2 years ago

Background

I've obtained from the Drug OT parquet files the structures/targets in interaction with each drug and their associated target gene ID. Now I need to filter the ions, small molecules etc. which are crystallizing agent (for X-ray Crystallography). Even if they are drugs, their interaction with targets are not specific/not interesting to display on the OT Platform.

How determine which drug is a crystallizing agent in an automated way?

Extract of the data

+------------------+---------------+--------------------+--------------------+
|MOLECULE_CHEMBL_ID|MOLECULE_PDB_ID|        STRUCTURE_ID|NB_OF_STRUCT_PER_MOL|
+------------------+---------------+--------------------+--------------------+
|         CHEMBL692|            GOL|[1ah8, 1ahp, 1ayf...|               20302|
|     CHEMBL1236970|             ZN|[12ca, 1a0b, 1a0q...|               18839|
|      CHEMBL457299|            EDO|[1atg, 1azo, 1b63...|               12891|
|      CHEMBL113178|            MSE|[1a62, 1a7a, 1a8o...|                9800|
|         CHEMBL504|            DMS|[1bju, 1bjv, 1c1p...|                3175|
|     CHEMBL1232653|            FAD|[1a8p, 1ahv, 1ahz...|                2449|
|       CHEMBL14249|            ATP|[1a0i, 1a49, 1a5u...|                1856|
|     CHEMBL1234276|            MES|[10gs, 11gs, 12gs...|                1773|
|     CHEMBL1235259|            PGE|[1a9z, 1ksf, 1lrj...|                1712|
|      CHEMBL284377|            SEP|[1a37, 1apm, 1atp...|                1621|
|     CHEMBL1234613|            NAD|[1a4z, 1a5z, 1a71...|                1620|
|      CHEMBL384759|            GDP|[1a2k, 1a4r, 1aa9...|                1603|
|      CHEMBL116736|            FMT|[1a5n, 1a5o, 1an5...|                1562|
|      CHEMBL295069|            NAP|[1a27, 1ads, 1ae1...|                1529|
|     CHEMBL1201794|            FMN|[1ag9, 1ahn, 1akq...|                1345|
|     CHEMBL1614854|            BGC|[1aa5, 1abr, 1au1...|                1336|
|        CHEMBL1261|            CIT|[1a59, 1afl, 1agr...|                1246|
|     CHEMBL1233147|            GTP|[1a8r, 1a9c, 1c1y...|                1239|
|       CHEMBL82202|            PLP|[1a3g, 1a50, 1a5a...|                1183|
|      CHEMBL418052|            SAH|[10mh, 1af7, 1aqi...|                1061|
+------------------+---------------+--------------------+--------------------+
only showing top 20 rows

Total number of structures: 93819

Observation

Seems like the first 4 drugs are linked to a large number of target which can indicate that they are crystallizing agent.

ireneisdoomed commented 2 years ago

Thanks for sharing this!

I was hoping that a way that we could filter these molecules was to keep only those that have any indication, but as we can see with cases like CHEMBL1236970 or CHEMBL692, this strategy wouldn't be sufficient.

A different approach is to filter out those drugs without a known mechanism of action so that we are more confident that the structures we are displaying are relevant. @MarineGirardey, could you please paste here a file with such dataset to include all ChEMBL IDs for which we have a target? This information will come from the molecule dataset in the field linkedTargets. Please keep the list of this targets in the final dataframe.

Once we have this, we can later see if this is an useful way of filtering the molecules.

DSuveges commented 2 years ago

This is a complicated issue. When looking at the data, 3675 molecule from our molecule index have cross reference to PDB (that many molecule can be found in at least one structure). However, if we look at the linked indications and targets, we see that ~3k of them has no disease or target associated. That's why I think applying such a filter would not be good.

I think we can exclude some of the obvious ones (eg. glycol, glycerine), then we can take a loo k at the targets and we just keep one structure for each target. But this decision has many dependencies on how the analysis of the structure go.

The attached json is the molecules and some annotation: molecules_w_pdb.json.gz

ireneisdoomed commented 2 years ago

I have just had a conversation about this issue with @madasme, from ChEMBL, who is also working on mining more information on the ligands bound to proteins in PDB.

She uses a list of artifacts from BioLiP (publication) to discard those ligands that are considered as crystallization artifact. She filters these 465 PDB IDs out of the initial table where structures are linked to the molecules.

Note that such a list includes cofactors like Lithium, which we know are valuable.

d0choa commented 2 years ago

This is starting to look like a Venn diagram with multiple sets: all drugs, drugs with MoA including the protein, co-crystalising artefacts, natural ligands, etc. There is probably a sweet spot with the things we are interested in