jyaacoub commented 3 months ago

Rebuilding existing datasets is unnecessary since the XY.csv is what is used to get items: https://github.com/jyaacoub/MutDTA/blob/dd324d33ae001c0126c3c2aceba895a16944fb5a/src/data_prep/datasets.py#L255-L260

We need to define a new resplit function to take the dataset, delete all the old train, test, and val subsets and replace them with new subsets that are defined by the following constraints:

New test set must contain all proteins from the test_gene_names.csv file that was used for existing analyses (#95 #94)
Test set must also include at least 2-3 "heavily targeted" proteins so that we can do a deeper analysis focusing on just those proteins. This resource should be helpful for identifying such heavily targeted proteins by the number of times they appear in the DataFrame for the interactions.tsv file.

jyaacoub commented 3 months ago

`resplit` function:

Should be defined in - MutDTA/src/train_test/splitting.py

Takes as input the target dataset path (or list of options that define that dataset), and a list defining the splits for all 5 folds + 1 test set.
Deletes existing splits
Builds new splits (this is already defined in Dataset.save_subset()

jyaacoub commented 3 months ago

Whats left:

[x] Define resplit function which accepts the list of csvs defining our split and re-splits the target dataset using its "full" db.
[x] Optionally on top of this function we can add a wrapper that takes as input a path to the dataset it wants to be "like"
[x] Define test sets for Kiba and PDBbind with OncoKB file

jyaacoub commented 3 months ago

test split for kiba

Test set size goal of ($118083\times 0.1 \approx 11808$)

This means all the proteins from the davis test set can be added to kiba test set
- Combining entire old dataset and davis genes gives us a total of 6028 entries
next step is to add more proteins from OncoKB (we need $11808-6028=5780 \text{ entries}$)
From the remaining matching genes from OncoKB we can just add all of them since they only give us an additional 3680 (we still need $11808-6028-3680=2100 \text{ entries}$).
For the last 2100 we just randomly sample until we arrive at the final test set below:

Code

```python # %% import pandas as pd import logging DATA_ROOT = '../data' biom_df = pd.read_csv(f'{DATA_ROOT}/tcga/mart_export.tsv', sep='\t') biom_df.rename({'Gene name': 'gene'}, axis=1, inplace=True) # %% Specific to kiba: kiba_df = pd.read_csv(f'{DATA_ROOT}/DavisKibaDataset/kiba/nomsa_binary_original_binary/full/XY.csv') kiba_df = kiba_df.merge(biom_df.drop_duplicates('UniProtKB/Swiss-Prot ID'), left_on='prot_id', right_on="UniProtKB/Swiss-Prot ID", how='left') kiba_df.drop(['PDB ID', 'UniProtKB/Swiss-Prot ID'], axis=1, inplace=True) if kiba_df.gene.isna().sum() != 0: logging.warning("Some proteins failed to get their gene names!") # %% making sure to add any matching davis prots to the kiba test set davis_df = pd.read_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/davis_test.csv') davis_test_prots = set(davis_df.prot_id.str.split('(').str[0]) kiba_davis_gene_overlap = kiba_df[kiba_df.gene.isin(davis_test_prots)].gene.value_counts() print("Total # of gene overlap with davis TEST set:", len(kiba_davis_gene_overlap)) print(" # of entries in kiba:", kiba_davis_gene_overlap.sum()) # Starting off with davis test set as the initial test set: kiba_test_df = kiba_df[kiba_df.gene.isin(davis_test_prots)] # %% using previous kiba test db: kiba_test_old_df = pd.read_csv('/cluster/home/t122995uhn/projects/downloads/test_prots_gene_names.csv') kiba_test_old_df = kiba_test_old_df[kiba_test_old_df['db'] == 'kiba'] kiba_test_old_prots = set(kiba_test_old_df.gene_name) kiba_test_df = pd.concat([kiba_test_df, kiba_df[kiba_df.gene.isin(kiba_test_old_prots)]], axis=0).drop_duplicates(['prot_id', 'lig_id']) print("Combined kiba test set with davis matching genes size:", len(kiba_test_df)) #%% NEXT STEP IS TO ADD MORE PROTS FROM ONCOKB IF AVAILABLE. onco_df = pd.read_csv("/cluster/home/t122995uhn/projects/downloads/oncoKB_DrugGenePairList.csv") kiba_join_onco = set(kiba_test_df.merge(onco_df.drop_duplicates("gene"), on="gene", how="left")['gene']) #%% remaining_onco = onco_df[~onco_df.gene.isin(kiba_join_onco)].drop_duplicates('gene') # match with remaining kiba: remaining_onco_kiba_df = kiba_df.merge(remaining_onco, on='gene', how="inner") counts = remaining_onco_kiba_df.value_counts('gene') print(counts) # this gives us 3680 which still falls short of our 11,808 goal for the test set size print("total entries in kiba with remaining (not already in test set) onco genes", counts.sum()) #%% # drop_duplicates is redundant but just in case. kiba_test_df = pd.concat([kiba_test_df, remaining_onco_kiba_df], axis=0).drop_duplicates(['prot_id', 'lig_id']) print("Combined kiba test set with remaining OncoKB genes:", len(kiba_test_df)) # %% For the remaining 2100 entries we will just choose those randomly until we reach our target of 11808 entries # code is from balanced_kfold_split function from collections import Counter import numpy as np # Get size for each dataset and indices dataset_size = len(kiba_df) test_size = int(0.1 * dataset_size) # 11808 indices = list(range(dataset_size)) # getting counts for each unique protein prot_counts = kiba_df['prot_id'].value_counts().to_dict() prots = list(prot_counts.keys()) np.random.shuffle(prots) # manually selected prots: test_prots = set(kiba_test_df.prot_id) # increment count by number of samples in test_prots count = sum([prot_counts[p] for p in test_prots]) #%% ## Sampling remaining proteins for test set (if we are under the test_size) for p in prots: # O(k); k = number of proteins if count + prot_counts[p] < test_size: test_prots.add(p) count += prot_counts[p] additional_prots = test_prots - set(kiba_test_df.prot_id) print('additional prot_ids to add:', len(additional_prots)) print(' count:', count) #%% ADDING FINAL PROTS rand_sample_df = kiba_df[kiba_df.prot_id.isin(additional_prots)] kiba_test_df = pd.concat([kiba_test_df, rand_sample_df], axis=0).drop_duplicates(['prot_id', 'lig_id']) kiba_test_df.drop(['cancerType', 'drug'], axis=1, inplace=True) print('final test dataset for kiba:') kiba_test_df #%% saving kiba_test_df.to_csv('/cluster/home/t122995uhn/projects/MutDTA/splits/kiba_test.csv', index=False) ```

OR:

https://github.com/jyaacoub/MutDTA/blob/c7fdc86c79dba16b663618fcd44feb02cababdfe/playground.py#L1-L95

jyaacoub commented 3 months ago

Test set for PDBbind

Test set size Goal of $16265*0.1\approx1626$

Initial stats after getting gene names by matching with biomart:

                            match on PDB ID: 1120
                           match on prot_id: 1039

Combined match (not accounting for aliases): 1216
 pdb_df.gene_x.combine_first(pdb_df.gene_y): 1138

           num genes where gene_x != gene_y: 237

   Total number of entries with a gene name: 8624/16265

Number of entries after merging gene names with kiba test set: 171
- Number of genes: 13
Total # of gene overlap with davis TEST set: 6
- entries in pdb: 60
- This entirely overlaps with kiba so there is no change in test set size.
Adding remaining matching with OncoKB proteins gives us an additional 93 genes for a total of 264 entries
The remaining $1626-264=1362$ entries will be randomly sampled to arrive at our final test dataset with 1603 entries

https://github.com/jyaacoub/MutDTA/blob/256563cca5540787541e4c37ab0c8966bc08abd1/playground.py#L1-L124

jyaacoub commented 2 months ago

This is resolved by https://github.com/jyaacoub/MutDTA/commit/69add713bda16c9f39fc8f874a0fff6f92b94314, and we now have constant validation sets for each CV training run.

jyaacoub / MutDTA

Unify cross validation splits to use consistent sets #113

`resplit` function:

Whats left:

test split for kiba

Test set size goal of ($118083\times 0.1 \approx 11808$)

Test set for PDBbind

Test set size Goal of $16265*0.1\approx1626$

jyaacoub / MutDTA

Unify cross validation splits to use consistent sets #113

resplit function:

Whats left:

test split for kiba

Test set size goal of ($118083\times 0.1 \approx 11808$)

Test set for PDBbind

Test set size Goal of $16265*0.1\approx1626$

`resplit` function: