jyaacoub commented 4 months ago

As mentioned in #95 we should only focus on datasets that already have a GVPL dataset created

Once the dataset is created DO NOT retrain on a new dataset, otherwise this will mess up which proteins are considered as the "test set" proteins and so it would mess up our analysis.
Specifically we are talking about the kiba dataset since we are still getting aflow confirmations for that one.

jyaacoub commented 4 months ago

Getting right protein sequences to focus on:

Using the TCGA analysis script (at this commit for playground.py) we only get protein sequences for the dbs that we have GVPL datasets for (not kiba).

This gives us the following stats for just **davis** and **pdbbind**

``` Filter #1 (seq_len) : 724 - 562 = 162 Filter #2 (ref_AA match): 160 - 42 = 118 davis 104 PDBbind 14 Name: db, dtype: int64 ``` ![image](https://github.com/jyaacoub/MutDTA/assets/50300488/3aed7b45-e672-4fbc-b05f-752261350f42)

jyaacoub commented 4 months ago

Get mutated sequences

This is simple and the code for it is just one .apply method on the dataframe that comes out:

Code to get 'mt_seq' col

```python def apply_mut(row): ref_seq = list(row['prot_seq']) ref_seq[row['mt_loc']-1] = row['mt_AA'] return ''.join(ref_seq) dfm['mt_seq'] = dfm.apply(apply_mut, axis=1) ```

jyaacoub commented 4 months ago

Run MSA

Using HHBlits via our MSARunner class we can get sequence alignments with our reference UniRef30_2020_06 database

See #78 for examples of how this was run on davis.

submit as batch array job

Details

```python # %% from src.utils.seq_alignment import MSARunner from tqdm import tqdm import pandas as pd import os DATA_DIR = '/cluster/home/t122995uhn/projects/data/tcga' CSV = f'{DATA_DIR}/tcga_maf_davis_pdbbind.csv' N_CPUS=6 NUM_ARRAYS = 10 array_idx = 0#${SLURM_ARRAY_TASK_ID} df = pd.read_csv(CSV, index_col=0) df.sort_values(by='seq_len_y', inplace=True) # %% for DB in df.db.unique(): print('DB', DB) RAW_DIR = f'{DATA_DIR}/{DB}' # should already be unique if these are proteins mapped form tcga! unique_df = df[df['db'] == DB] ########################## Get job partition partition_size = len(unique_df) / NUM_ARRAYS start, end = int(array_idx*partition_size), int((array_idx+1)*partition_size) unique_df = unique_df[start:end] #################################### create fastas fa_dir = os.path.join(RAW_DIR, f'{DB}_fa') fasta_fp = lambda idx,pid: os.path.join(fa_dir, f"{idx}-{pid}.fasta") os.makedirs(fa_dir, exist_ok=True) for idx, (prot_id, pro_seq) in tqdm( unique_df[['prot_id', 'prot_seq']].iterrows(), desc='Creating fastas', total=len(unique_df)): with open(fasta_fp(idx,prot_id), "w") as f: f.write(f">{prot_id},{idx},{DB}\n{pro_seq}") ##################################### Run hhblits aln_dir = os.path.join(RAW_DIR, f'{DB}_aln') aln_fp = lambda idx,pid: os.path.join(aln_dir, f"{idx}-{pid}.a3m") os.makedirs(aln_dir, exist_ok=True) # finally running for idx, (prot_id, pro_seq) in tqdm( unique_df[['prot_id', 'mt_seq']].iterrows(), desc='Running hhblits', total=len(unique_df)): in_fp = fasta_fp(idx,prot_id) out_fp = aln_fp(idx,prot_id) if not os.path.isfile(out_fp): MSARunner.hhblits(in_fp, out_fp, n_cpus=N_CPUS) print('DONE!') ```

jyaacoub / MutDTA

Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99

Getting right protein sequences to focus on:

Get mutated sequences

Run MSA

submit as batch array job