jyaacoub / MutDTA

Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.
1 stars 2 forks source link

Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99

Closed jyaacoub closed 2 months ago

jyaacoub commented 4 months ago

As mentioned in #95 we should only focus on datasets that already have a GVPL dataset created

jyaacoub commented 4 months ago

Getting right protein sequences to focus on:

Using the TCGA analysis script (at this commit for playground.py) we only get protein sequences for the dbs that we have GVPL datasets for (not kiba).

This gives us the following stats for just **davis** and **pdbbind**

``` Filter #1 (seq_len) : 724 - 562 = 162 Filter #2 (ref_AA match): 160 - 42 = 118 davis 104 PDBbind 14 Name: db, dtype: int64 ``` ![image](https://github.com/jyaacoub/MutDTA/assets/50300488/3aed7b45-e672-4fbc-b05f-752261350f42)

jyaacoub commented 4 months ago

Get mutated sequences

This is simple and the code for it is just one .apply method on the dataframe that comes out:

Code to get 'mt_seq' col

```python def apply_mut(row): ref_seq = list(row['prot_seq']) ref_seq[row['mt_loc']-1] = row['mt_AA'] return ''.join(ref_seq) dfm['mt_seq'] = dfm.apply(apply_mut, axis=1) ```

jyaacoub commented 4 months ago

Run MSA

Using HHBlits via our MSARunner class we can get sequence alignments with our reference UniRef30_2020_06 database

submit as batch array job

Details

```python # %% from src.utils.seq_alignment import MSARunner from tqdm import tqdm import pandas as pd import os DATA_DIR = '/cluster/home/t122995uhn/projects/data/tcga' CSV = f'{DATA_DIR}/tcga_maf_davis_pdbbind.csv' N_CPUS=6 NUM_ARRAYS = 10 array_idx = 0#${SLURM_ARRAY_TASK_ID} df = pd.read_csv(CSV, index_col=0) df.sort_values(by='seq_len_y', inplace=True) # %% for DB in df.db.unique(): print('DB', DB) RAW_DIR = f'{DATA_DIR}/{DB}' # should already be unique if these are proteins mapped form tcga! unique_df = df[df['db'] == DB] ########################## Get job partition partition_size = len(unique_df) / NUM_ARRAYS start, end = int(array_idx*partition_size), int((array_idx+1)*partition_size) unique_df = unique_df[start:end] #################################### create fastas fa_dir = os.path.join(RAW_DIR, f'{DB}_fa') fasta_fp = lambda idx,pid: os.path.join(fa_dir, f"{idx}-{pid}.fasta") os.makedirs(fa_dir, exist_ok=True) for idx, (prot_id, pro_seq) in tqdm( unique_df[['prot_id', 'prot_seq']].iterrows(), desc='Creating fastas', total=len(unique_df)): with open(fasta_fp(idx,prot_id), "w") as f: f.write(f">{prot_id},{idx},{DB}\n{pro_seq}") ##################################### Run hhblits aln_dir = os.path.join(RAW_DIR, f'{DB}_aln') aln_fp = lambda idx,pid: os.path.join(aln_dir, f"{idx}-{pid}.a3m") os.makedirs(aln_dir, exist_ok=True) # finally running for idx, (prot_id, pro_seq) in tqdm( unique_df[['prot_id', 'mt_seq']].iterrows(), desc='Running hhblits', total=len(unique_df)): in_fp = fasta_fp(idx,prot_id) out_fp = aln_fp(idx,prot_id) if not os.path.isfile(out_fp): MSARunner.hhblits(in_fp, out_fp, n_cpus=N_CPUS) print('DONE!') ```