jyaacoub / MutDTA

Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.
1 stars 2 forks source link

Build Downloader for getting binding pocket sequences from KLIFS #109

Closed jyaacoub closed 1 week ago

jyaacoub commented 1 week ago

Getting pockets for Kiba

  1. Using UniProt ID we can get the pocket from the KLIFS database with the /kinase_ID API.
    1. For example: https://klifs.net/api/kinase_ID?kinase_name=O00141&species=HUMAN returns: image

Getting pockets for davis:

Same as for kiba, but we use the raw Gene Name code (need to remove any mutation or phosphorylation information): ABL1(F317I)p -> ABL1

  1. For example: https://klifs.net/api/kinase_ID?kinase_name=ABL1&species=HUMAN returns: image
jyaacoub commented 1 week ago

Number of pocket sequences found from KLIFS:

PDBbindDataset: 177/3785 (3608)
davis: 321/442 (121)
kiba: 221/228 (7)
Code

```python # %% import pandas as pd from src.data_prep.downloaders import Downloader df = pd.read_csv('../data/all_prots.csv') id_status = {} for db in df.db.unique(): id = Downloader.download_pocket_seq(df[df.db == db].prot_id.to_list(), f"../data/pocket_seq/{db}/", tqdm_desc=f"Downloading {db} pocket sequences") id_status[db] = id #%% import json # json.dump(id_status, open('../data/pocket_seq/seq_out.json', 'w')) # id_status = json.load(open('../data/pocket_seq/seq_out.json', 'r')) for db, st in id_status.items(): total_ids = len(st) missing = list(id_status[db].values()).count(400) print(f"{db}: {total_ids - missing}/{total_ids} ({missing})") ```

import pandas as pd
from src.data_prep.downloaders import Downloader

df = pd.read_csv('../data/all_prots.csv')

id_status = {}
for db in ['davis']:#df.db.unique():
    if db == 'davis':
        gene_names = df[df.db == db].prot_id.to_list()
        ids = [gene.split('(')[0] for gene in gene_names] # get rid of mutation specifiers
        # get rid of phospho-specifiers
        ids = [gene.split('-')[0] for gene in ids]
        # get rid of trailing p:
        ids = [gene.split('p')[0] for gene in ids]
    else:
        ids = df[df.db == db].prot_id.to_list()

    id = Downloader.download_pocket_seq(ids, 
                                        f"../data/pocket_seq/{db}/",
                                        tqdm_desc=f"Downloading {db} pocket sequences")
    id_status[db] = id
#%%
import json
# json.dump(id_status, open('../data/pocket_seq/seq_out.json', 'w'))
# id_status = json.load(open('../data/pocket_seq/seq_out.json', 'r'))
for db, st in id_status.items():
    total_ids = len(st)
    missing = list(id_status[db].values()).count(400)
    print(f"{db}: {total_ids - missing}/{total_ids} ({missing})")