ikmckenz / target-pred-py

A simple machine learning model for small-molecule target prediction in Python.
GNU General Public License v3.0
18 stars 8 forks source link

Implement Swiss Predict query of ChEMBL #1

Closed ikmckenz closed 5 years ago

ikmckenz commented 5 years ago

Should implement the dataset query from the original theoretical paper behind Swiss Predict. Should be a new query in src/data/chembl_etl.py, the existing one should be moved to its own function and the class slightly refactored to make it easy to switch between different queries we might implement (eventually there will be several).

From the paper:

Release 15 of the ChEMBL database (Gaultonet al., 2012) was used throughout this work. Interactions were selected according to the following criteria: they should (i) involve human proteins, (ii) be annotated as direct binding ('assay_type'='B') with an activity (Ki,Kd,IC50 or EC50)<10mM, (iii) involve molecules consisting of <80 heavy atoms and (iv) involve targets that are single proteins or protein complexes (e.g. excluding targets corresponding to protein families and assays with a confidence level <4). We further discarded ambiguous interactions that had reported activity values both below and above 10mM in different assays. This was done to address the observed uncertainty of many protein–small molecule interaction datasets (Krameret al., 2012). This results in a set of 347 889 interactions involving 1700 human proteins (1627) or protein complexes (73) and 224412 molecules. As an additional benchmark, we also retrieved all ChEMBL molecules interacting with human proteins only with activities between 10mM and 100mM (i.e. none of them are part of the previous set of molecules). This consists of 79 682 molecules involved in 94 672 interactions (see Section 3.4). For all molecules, SMILES were retrieved from ChEMBL using the parent form. To compute the fraction of molecules with functional activity but without direct target, we retrieved all molecules involved in assays with assay_type='F' in human using the same threshold of 10mM (340 256 molecules in total). In all, 59311 of them (17.4%) do not have direct targets in ChEMBL based on the two criteria: (i) no binding data or only binding activity >1000mM, and (ii) target_type equal to 'ORGANISM', 'CELL-LINE', 'TISSUE' or 'ADMET'. When determining whether two molecules have been tested in the same assay, all ChEMBL assays involving a human target were considered.