Build Downloader for getting binding pocket sequences from KLIFS

Number of pocket sequences found from KLIFS:

PDBbindDataset: 177/3785 (3608)
davis: 321/442 (121)
kiba: 221/228 (7)

Code

```python # %% import pandas as pd from src.data_prep.downloaders import Downloader df = pd.read_csv('../data/all_prots.csv') id_status = {} for db in df.db.unique(): id = Downloader.download_pocket_seq(df[df.db == db].prot_id.to_list(), f"../data/pocket_seq/{db}/", tqdm_desc=f"Downloading {db} pocket sequences") id_status[db] = id #%% import json # json.dump(id_status, open('../data/pocket_seq/seq_out.json', 'w')) # id_status = json.load(open('../data/pocket_seq/seq_out.json', 'r')) for db, st in id_status.items(): total_ids = len(st) missing = list(id_status[db].values()).count(400) print(f"{db}: {total_ids - missing}/{total_ids} ({missing})") ```

Note that for davis some gene names have mutation specific info that would cause it to not match (e.g.: ABL1(E55K) instead of just ABL1)
Accounting for this by removing anything after the first '(' (and stuff like phosphorylation info) gives only 39 missing prots
Code for just davis

import pandas as pd
from src.data_prep.downloaders import Downloader

df = pd.read_csv('../data/all_prots.csv')

id_status = {}
for db in ['davis']:#df.db.unique():
    if db == 'davis':
        gene_names = df[df.db == db].prot_id.to_list()
        ids = [gene.split('(')[0] for gene in gene_names] # get rid of mutation specifiers
        # get rid of phospho-specifiers
        ids = [gene.split('-')[0] for gene in ids]
        # get rid of trailing p:
        ids = [gene.split('p')[0] for gene in ids]
    else:
        ids = df[df.db == db].prot_id.to_list()

    id = Downloader.download_pocket_seq(ids, 
                                        f"../data/pocket_seq/{db}/",
                                        tqdm_desc=f"Downloading {db} pocket sequences")
    id_status[db] = id
#%%
import json
# json.dump(id_status, open('../data/pocket_seq/seq_out.json', 'w'))
# id_status = json.load(open('../data/pocket_seq/seq_out.json', 'r'))
for db, st in id_status.items():
    total_ids = len(st)
    missing = list(id_status[db].values()).count(400)
    print(f"{db}: {total_ids - missing}/{total_ids} ({missing})")

jyaacoub / MutDTA

Build Downloader for getting binding pocket sequences from KLIFS #109

Getting pockets for Kiba

Getting pockets for davis: