jyaacoub / MutDTA

Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.
1 stars 2 forks source link

Unify cross validation splits to use consistent sets #113

Closed jyaacoub closed 2 months ago

jyaacoub commented 3 months ago

Rebuilding existing datasets is unnecessary since the XY.csv is what is used to get items: https://github.com/jyaacoub/MutDTA/blob/dd324d33ae001c0126c3c2aceba895a16944fb5a/src/data_prep/datasets.py#L255-L260

We need to define a new resplit function to take the dataset, delete all the old train, test, and val subsets and replace them with new subsets that are defined by the following constraints:

jyaacoub commented 3 months ago

resplit function:

Should be defined in - MutDTA/src/train_test/splitting.py

jyaacoub commented 3 months ago

Whats left:

jyaacoub commented 3 months ago

test split for kiba

Test set size goal of ($118083\times 0.1 \approx 11808$)

Code

```python # %% import pandas as pd import logging DATA_ROOT = '../data' biom_df = pd.read_csv(f'{DATA_ROOT}/tcga/mart_export.tsv', sep='\t') biom_df.rename({'Gene name': 'gene'}, axis=1, inplace=True) # %% Specific to kiba: kiba_df = pd.read_csv(f'{DATA_ROOT}/DavisKibaDataset/kiba/nomsa_binary_original_binary/full/XY.csv') kiba_df = kiba_df.merge(biom_df.drop_duplicates('UniProtKB/Swiss-Prot ID'), left_on='prot_id', right_on="UniProtKB/Swiss-Prot ID", how='left') kiba_df.drop(['PDB ID', 'UniProtKB/Swiss-Prot ID'], axis=1, inplace=True) if kiba_df.gene.isna().sum() != 0: logging.warning("Some proteins failed to get their gene names!") # %% making sure to add any matching davis prots to the kiba test set davis_df = pd.read_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/davis_test.csv') davis_test_prots = set(davis_df.prot_id.str.split('(').str[0]) kiba_davis_gene_overlap = kiba_df[kiba_df.gene.isin(davis_test_prots)].gene.value_counts() print("Total # of gene overlap with davis TEST set:", len(kiba_davis_gene_overlap)) print(" # of entries in kiba:", kiba_davis_gene_overlap.sum()) # Starting off with davis test set as the initial test set: kiba_test_df = kiba_df[kiba_df.gene.isin(davis_test_prots)] # %% using previous kiba test db: kiba_test_old_df = pd.read_csv('/cluster/home/t122995uhn/projects/downloads/test_prots_gene_names.csv') kiba_test_old_df = kiba_test_old_df[kiba_test_old_df['db'] == 'kiba'] kiba_test_old_prots = set(kiba_test_old_df.gene_name) kiba_test_df = pd.concat([kiba_test_df, kiba_df[kiba_df.gene.isin(kiba_test_old_prots)]], axis=0).drop_duplicates(['prot_id', 'lig_id']) print("Combined kiba test set with davis matching genes size:", len(kiba_test_df)) #%% NEXT STEP IS TO ADD MORE PROTS FROM ONCOKB IF AVAILABLE. onco_df = pd.read_csv("/cluster/home/t122995uhn/projects/downloads/oncoKB_DrugGenePairList.csv") kiba_join_onco = set(kiba_test_df.merge(onco_df.drop_duplicates("gene"), on="gene", how="left")['gene']) #%% remaining_onco = onco_df[~onco_df.gene.isin(kiba_join_onco)].drop_duplicates('gene') # match with remaining kiba: remaining_onco_kiba_df = kiba_df.merge(remaining_onco, on='gene', how="inner") counts = remaining_onco_kiba_df.value_counts('gene') print(counts) # this gives us 3680 which still falls short of our 11,808 goal for the test set size print("total entries in kiba with remaining (not already in test set) onco genes", counts.sum()) #%% # drop_duplicates is redundant but just in case. kiba_test_df = pd.concat([kiba_test_df, remaining_onco_kiba_df], axis=0).drop_duplicates(['prot_id', 'lig_id']) print("Combined kiba test set with remaining OncoKB genes:", len(kiba_test_df)) # %% For the remaining 2100 entries we will just choose those randomly until we reach our target of 11808 entries # code is from balanced_kfold_split function from collections import Counter import numpy as np # Get size for each dataset and indices dataset_size = len(kiba_df) test_size = int(0.1 * dataset_size) # 11808 indices = list(range(dataset_size)) # getting counts for each unique protein prot_counts = kiba_df['prot_id'].value_counts().to_dict() prots = list(prot_counts.keys()) np.random.shuffle(prots) # manually selected prots: test_prots = set(kiba_test_df.prot_id) # increment count by number of samples in test_prots count = sum([prot_counts[p] for p in test_prots]) #%% ## Sampling remaining proteins for test set (if we are under the test_size) for p in prots: # O(k); k = number of proteins if count + prot_counts[p] < test_size: test_prots.add(p) count += prot_counts[p] additional_prots = test_prots - set(kiba_test_df.prot_id) print('additional prot_ids to add:', len(additional_prots)) print(' count:', count) #%% ADDING FINAL PROTS rand_sample_df = kiba_df[kiba_df.prot_id.isin(additional_prots)] kiba_test_df = pd.concat([kiba_test_df, rand_sample_df], axis=0).drop_duplicates(['prot_id', 'lig_id']) kiba_test_df.drop(['cancerType', 'drug'], axis=1, inplace=True) print('final test dataset for kiba:') kiba_test_df #%% saving kiba_test_df.to_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/kiba_test.csv', index=False) ```

OR:

https://github.com/jyaacoub/MutDTA/blob/c7fdc86c79dba16b663618fcd44feb02cababdfe/playground.py#L1-L95

jyaacoub commented 3 months ago

Test set for PDBbind

Test set size Goal of $16265*0.1\approx1626$

Initial stats after getting gene names by matching with biomart:

                            match on PDB ID: 1120
                           match on prot_id: 1039

Combined match (not accounting for aliases): 1216
 pdb_df.gene_x.combine_first(pdb_df.gene_y): 1138

           num genes where gene_x != gene_y: 237

   Total number of entries with a gene name: 8624/16265

https://github.com/jyaacoub/MutDTA/blob/256563cca5540787541e4c37ab0c8966bc08abd1/playground.py#L1-L124

jyaacoub commented 2 months ago

This is resolved by https://github.com/jyaacoub/MutDTA/commit/69add713bda16c9f39fc8f874a0fff6f92b94314, and we now have constant validation sets for each CV training run.