jyaacoub / MutDTA

Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.
https://drive.google.com/drive/folders/1mdiA1gf1IjPZNhk79I2cYUu6pwcH0OTD
2 stars 2 forks source link

Consolidate all results #147

Open jyaacoub opened 4 weeks ago

jyaacoub commented 4 weeks ago
Details

![image](https://github.com/user-attachments/assets/6809f25b-8d21-448e-bbdc-43aac3527b4a) ```python import logging from matplotlib import pyplot as plt from src.analysis.figures import prepare_df, fig_combined, custom_fig dft = prepare_df('./results/model_media/model_stats.csv') dfv = prepare_df('./results/model_media/model_stats_val.csv') models = { 'DG': ('nomsa', 'binary', 'original', 'binary'), 'esm': ('ESM', 'binary', 'original', 'binary'), # esm model 'aflow': ('nomsa', 'aflow', 'original', 'binary'), # 'gvpP': ('gvp', 'binary', 'original', 'binary'), 'gvpL': ('nomsa', 'binary', 'gvp', 'binary'), # 'aflow_ring3': ('nomsa', 'aflow_ring3', 'original', 'binary'), 'gvpL_aflow': ('nomsa', 'aflow', 'gvp', 'binary'), # 'gvpl_esm':('ESM', 'binary', 'gvp', 'binary'), # 'gvpL_aflow_rng3': ('nomsa', 'aflow_ring3', 'gvp', 'binary'), #GVPL_ESMM_davis3D_nomsaF_aflowE_48B_0.00010636872718329864LR_0.23282479481785903D_2000E_gvpLF_binaryLE # 'gvpl_esm_aflow': ('ESM', 'aflow', 'gvp', 'binary'), } fig, axes = fig_combined(dft, datasets=['davis', 'kiba', 'PDBbind'], fig_callable=custom_fig, models=models, metrics=['cindex', 'mse'], fig_scale=(10,5), add_stats=True, title_postfix=" test set performance", box=True) plt.xticks(rotation=45) # fig, axes = fig_combined(dfv, datasets=['davis'], fig_callable=custom_fig, # models=models, metrics=['cindex', 'mse'], # fig_scale=(10,5), add_stats=True, title_postfix=" validation set performance", box=True, fold_labels=True) # plt.xticks(rotation=45) ```

Final models - these are the ones we will show in the paper.


    models = {
        'DG': ('nomsa', 'binary', 'original', 'binary'),
        'esm': ('ESM', 'binary', 'original', 'binary'), # esm model
        'aflow': ('nomsa', 'aflow', 'original', 'binary'),
        # 'gvpP': ('gvp', 'binary', 'original', 'binary'),
        'gvpL': ('nomsa', 'binary', 'gvp', 'binary'),
        # 'aflow_ring3': ('nomsa', 'aflow_ring3', 'original', 'binary'),
        'gvpL_aflow': ('nomsa', 'aflow', 'gvp', 'binary'),
        # 'gvpl_esm':('ESM', 'binary', 'gvp', 'binary'),
        # 'gvpL_aflow_rng3': ('nomsa', 'aflow_ring3', 'gvp', 'binary'),
        #GVPL_ESMM_davis3D_nomsaF_aflowE_48B_0.00010636872718329864LR_0.23282479481785903D_2000E_gvpLF_binaryLE
        # 'gvpl_esm_aflow': ('ESM', 'aflow', 'gvp', 'binary'),
    }
jyaacoub commented 1 week ago

FIG 1

DATASET INFO

TABLE COUNTS

#### FULL TABLE COUNTS: ``` | Dataset | Protein | Compound | Total Binding Entities | |-----------|-----------|------------|-------------------------| | davis | 442 | 68 | 30056 | | kiba | 229 | 2111 | 118254 | | pdbbind | 3889 | 12639 | 19443 | ``` #### USED TABLE COUNTS: Due to memory limitations a couple records were excluded from our runs this is the full count that were actually used. ``` Dataset Protein Compound Total Binding Entities 0 davis 439 68 29852 1 kiba 226 2111 117590 2 pdbbind 3785 10950 16265 ```

SEQUENCE LENGTH DISTRIBUTION

#### non-overlayed or normalized plot ![image](https://github.com/user-attachments/assets/0ff6f518-1ab1-4bc2-9a96-f434c483e723) #### normalized and overlayed plot ![image](https://github.com/user-attachments/assets/518b7abf-4a63-4f84-9277-46e31df39da2)

MODEL RESULTS

All (except pocket) results - 2x3 - MSE and Cindex

![image](https://github.com/user-attachments/assets/c5ed6522-2c5e-47d5-9950-9bd5c47c7ed9)

Stratified with pocket results

jyaacoub commented 1 week ago

FIG 3 - Platinum Dataset

DATASET INFO

TABLE COUNTS

``` Unique protein sequence counts: 860 Unique protein IDs: 361 Unique ligand counts: 197 Total records: 1962 ```

Distribution for the number of mutations per protein

![image](https://github.com/user-attachments/assets/2af6d4bf-1bac-4238-8907-8d37136f6fd4)

pkd distributions (stratified by # of mutations)

![image](https://github.com/user-attachments/assets/e376d64d-7bb2-47a7-8ecc-41f0f368943e) ![image](https://github.com/user-attachments/assets/79c6796b-7711-4220-a8f7-dfcc70b202c0)

Model results

Raw predictive performance

This plot shows the ability for the model to just predict the pkd given the protein sequence and ligand SMILES

Delta predictive performance

Instead of looking at absolute predictive performance this plot show how well the model is able to predict the delta between a mutated and unmutated sequence.