primary tasks

[x] download TCGA MAF files
[x] Map TCGA rows to proteins from our datasets
- DONT INCLUDE KIBA UNTIL WE GET ALL AFLOW CONFS (to ensure consistent test sets and so we don't have to redo the subsequent tasks)
[ ] #99
[ ] Run the same script as #94 to get results for davis, kiba, and pdbbind pretrained models.
[ ] #111

Downloading and getting TCGA MAF files

Downloading using *TCGAbiolinks*

### What project to use? *"TCGA projects are organized by cancer type or subtype."* Updated projects can be found [here](https://portal.gdc.cancer.gov/analysis_page?app=Projects), but lets just focus on [*TCGA-BRCA*](https://portal.gdc.cancer.gov/v1/projects/TCGA-BRCA) for now - using the [legacy version](https://portal.gdc.cancer.gov/v1/) of the data portal we can gain access to the open version of TCGA-BRCA instead of the newer but closed version ### How to download TCGA-BRCA mafs? Update sys packages ``` sudo apt update sudo apt upgrade -y ``` #### Install R [README](https://cran.r-project.org/bin/linux/ubuntu/fullREADME.html) Add apt repo: ```bash sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/" ``` Install R ```bash sudo apt update sudo apt install r-base ``` Install sys packages required by R ```bash sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev -y ``` #### Install [TCGABiolinks package](https://www.bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html) make sure to run in sudo mode ``` sudo -i R ``` Then install: ```r if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("TCGAbiolinks") ``` #### download TCGA-BRCA | sort(harmonized.data.type) | | :----------------------------- | | Aggregated Somatic Mutation | | ... | | **Masked Somatic Mutation** | | **Masked Somatic Mutation** | | ... | | Methylation Beta Value | | Splice Junction Quantification | ```R library(TCGAbiolinks) query <- GDCquery(project = "TCGA-BRCA", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", file.type = "maf.gz", access = "open") GDCdownload(query) data <- GDCprepare(query) ``` to exit R: ```r q() ``` #### Save TCGAbiolinks R file as CSV: ```R write.csv(data, "TCGA_BRCA_Mutations.csv", row.names = FALSE) ```

Another way is to just use the TCGA portal and download the entire cohort for each project

Matching mutations in TCGA to davis

Best way is to use Hugo_Symbol from the MAF file which is the gene name.

Using Biomart to get UniProtIDs

## Map davis protein names to uniprot IDs with Biomart - Can't map any mutated proteins since those are unique and don't have unique uniprot IDs to identify them. Since davis are all human we can just get the entire db from biomart and filter using pandas ![Pasted image 20240509134605](https://github.com/jyaacoub/MutDTA/assets/50300488/0e64c400-de69-4b94-a5ef-ba2916b80876) Biomart URL query for our case would be like: ``` http://useast.ensembl.org/biomart/martview/bd870a98ae9d3290205dae6651366761?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id_version|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id_version|hsapiens_gene_ensembl.default.feature_page.external_gene_name|hsapiens_gene_ensembl.default.feature_page.uniprotswissprot|hsapiens_gene_ensembl.default.feature_page.uniprot_gn_symbol&FILTERS=&VISIBLEPANEL=resultspanel ``` This matches 266 proteins from davis!

Code

```python #%% import os import pandas as pd from src.data_prep.processors import Processor root_dir = '../data/DavisKibaDataset/davis' df = pd.read_csv(f"{root_dir}/nomsa_binary_original_binary/full/XY.csv") df_unique = df.loc[df[['code']].drop_duplicates().index] df_unique.drop(['SMILE', 'pkd', 'prot_id'], axis=1, inplace=True) df_unique['code'] = df_unique['code'].str.upper() df_unique.columns = ['Gene name', 'prot_seq'] #%% df_mart = pd.read_csv('../downloads/biomart_hsapiens.tsv', sep='\t') df_mart = df_mart.loc[df_mart[['Gene name', 'UniProtKB/Swiss-Prot ID']].dropna().index] df_mart['Gene name'] = df_mart['Gene name'].str.upper() df_mart = df_mart.drop_duplicates(subset=['UniProtKB/Swiss-Prot ID']) #%% dfm = df_unique.merge(df_mart, on='Gene name', how='left') dfm[['Gene name', 'UniProtKB/Swiss-Prot ID']].to_csv('../downloads/davis_biomart_matches.csv') ```

- proteins with no matches were mutated or phosphorylated.

Using `Hugo_Symbol`

## using Hugo_symbol: TCGA MAF files have ["Hugo_Symbol"](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/#protected-maf-file-structure) which corresponds to the gene name for that mutation ([HUGO](https://www.genenames.org/))...

Code

```python #%% import pandas as pd df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna() #%% df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv") df = df.loc[df['SWISSPROT'].dropna().index] df[['Gene', 'SWISSPROT']].head() df['uniprot'] = df['SWISSPROT'].str.split('.').str[0] #%% dfm = df_pid.merge(df, on='uniprot', how='inner') # %% dfm[~(dfm.code == dfm.Hugo_Symbol)] # %% df_pid = pd.read_csv('../downloads/davis_pids.csv')[['code']] dfm_cd = df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='left') ```

Difference:

using biomart matched UniProts:    2324 total
Using raw davis HUGO protein name: 2321 total
biomart finds   3 extra

Code

```python #%% import pandas as pd df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna() #%% df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv") df = df.loc[df['SWISSPROT'].dropna().index] df[['Gene', 'SWISSPROT']].head() df['uniprot'] = df['SWISSPROT'].str.split('.').str[0] #%% dfm = df_pid.merge(df, on='uniprot', how='inner') # %% df_pid = pd.read_csv('../downloads/davis_pids.csv').drop_duplicates(subset='code')[['code']] dfmh= df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='inner') # %% print(f"using biomart matched UniProts: {len(dfm)} total") print(f"Using raw davis HUGO protein name: {len(dfmh)} total") print(f"biomart finds {len(dfm[~(dfm.code == dfm.Hugo_Symbol)]):3d} extra") ```

Things to consider for matching:

10-Variant_Classification Only focus on missense mutations
56-Protein_position tells us <mutation location>/seqlen
- Match only proteins that have matching seqlen
57-Amino_acids tells us <reference AA>/<mutated AA>.
- match only proteins with the same reference AA at mutation location (see above) NOTE: If not enough matching then we need to get the canonical sequence from Uniprot!

Columns for validating results:

These come from pathogenicy predictions from ensembl

77-SIFT a predictive model that returns the impact of mutation on protein function (SIFT).
78-PolyPhen another predictive model similar to SIFT, (PolyPhen-2)

not as useful:

123-IMPACT only looks at the Variant_Classification to deduce impact see docs

steps for matching dataset uniprot ids to TCGA mutations

TCGABiolinks uses https://gdc.cancer.gov/about-data/publications/mc3-2017 to get the MAF files.

according to their docs - https://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/mutation.html

1. Gather data for davis, kiba and PDBbind datasets

Need the following for matching, see above comment

Original index id from XY.csv
UniprotID
Protein Sequence

Code to combine all datasets into a single csv: https://github.com/jyaacoub/MutDTA/commit/5696a7aff4d0b1ed52a3a29ee77b6db258b62e23

cols are db_idx,db,code,prot_id,seq_len,prot_seq

2. Download ALL TCGA projects as a single MAF

Go to GDC portal
Click dropdown for cases
Enter selection view for "program"
Check only TCGA
Close out of the dropdown
scroll do the bottom and hit Download 49.25 MB compressed MAF data
3. Prefiltering TCGA
- Filter out by Variant_Classification (only focus on Missense_Mutation for now)
- Maybe also filter by Variant_Type to focus only on SINGLE nucleotide variants (SNP) -> doesnt matter since there is literally only 2 other rows with non-SNP variants:![[Pasted image 20240510144758.png|300]]
- Filter out sequences longer than 1200, practically speaking any sequence longer than this is not useful since it would take forever to run.
  4. Match Uniprot IDs with Mutations
- For davis we have to use hugo_symbols but the others should be fine
  5. Post filtering
- filter for only those sequences with matching sequence length (to get rid of nonmatched isoforms)
- Filter #1 (seq_len) : 7495 - 5054 = 2441
- Filter out those that don't have the same reference seq according to the "Protein_position" and "Amino_acids" col
- Filter #2 (ref_AA match): 2441 - 4 = 2437

Code

```python #%% 1.Gather data for davis,kiba and pdbbind datasets import os import pandas as pd import matplotlib.pyplot as plt from src.analysis.utils import combine_dataset_pids from src import config as cfg df_prots = combine_dataset_pids(dbs=[cfg.DATA_OPT.davis, cfg.DATA_OPT.PDBbind], # WARNING: just davis and pdbbind for now subset='test') #%% 2. Load TCGA data df_tcga = pd.read_csv('../downloads/TCGA_ALL.maf', sep='\t') #%% 3. Pre filtering df_tcga = df_tcga[df_tcga['Variant_Classification'] == 'Missense_Mutation'] df_tcga['seq_len'] = pd.to_numeric(df_tcga['Protein_position'].str.split('/').str[1]) df_tcga = df_tcga[df_tcga['seq_len'] < 5000] df_tcga['seq_len'].plot.hist(bins=100, title="sequence length histogram capped at 5K") plt.show() df_tcga = df_tcga[df_tcga['seq_len'] < 1200] df_tcga['seq_len'].plot.hist(bins=100, title="sequence length after capped at 1.2K") #%% 4. Merging df_prots with TCGA df_tcga['uniprot'] = df_tcga['SWISSPROT'].str.split('.').str[0] dfm = df_tcga.merge(df_prots[df_prots.db != 'davis'], left_on='uniprot', right_on='prot_id', how='inner') # for davis we have to merge on HUGO_SYMBOLS dfm_davis = df_tcga.merge(df_prots[df_prots.db == 'davis'], left_on='Hugo_Symbol', right_on='prot_id', how='inner') dfm = pd.concat([dfm,dfm_davis], axis=0) del dfm_davis # to save mem # %% 5. Post filtering step # 5.1. Filter for only those sequences with matching sequence length (to get rid of nonmatched isoforms) # seq_len_x is from tcga, seq_len_y is from our dataset tmp = len(dfm) # allow for some error due to missing amino acids from pdb file in PDBbind dataset # - assumption here is that isoforms will differ by more than 50 amino acids dfm = dfm[(dfm.seq_len_y <= dfm.seq_len_x) & (dfm.seq_len_x<= dfm.seq_len_y+50)] print(f"Filter #1 (seq_len) : {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}") # 5.2. Filter out those that dont have the same reference seq according to the "Protein_position" and "Amino_acids" col # Extract mutation location and reference amino acid from 'Protein_position' and 'Amino_acids' columns dfm['mt_loc'] = pd.to_numeric(dfm['Protein_position'].str.split('/').str[0]) dfm = dfm[dfm['mt_loc'] < dfm['seq_len_y']] dfm[['ref_AA', 'mt_AA']] = dfm['Amino_acids'].str.split('/', expand=True) dfm['db_AA'] = dfm.apply(lambda row: row['prot_seq'][row['mt_loc']-1], axis=1) # Filter #2: Match proteins with the same reference amino acid at the mutation location tmp = len(dfm) dfm = dfm[dfm['db_AA'] == dfm['ref_AA']] print(f"Filter #2 (ref_AA match): {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}") print('\n',dfm.db.value_counts()) # %% final seq len distribution n_bins = 25 lengths = dfm.seq_len_x fig, ax = plt.subplots(1, 1, figsize=(10, 5)) # Plot histogram n, bins, patches = ax.hist(lengths, bins=n_bins, color='blue', alpha=0.7) ax.set_title('TCGA final filtering for db matches') # Add counts to each bin for count, x, patch in zip(n, bins, patches): ax.text(x + 0.5, count, str(int(count)), ha='center', va='bottom') ax.set_xlabel('Sequence Length') ax.set_ylabel('Frequency') plt.tight_layout() plt.show() # %% Getting updated sequences def apply_mut(row): ref_seq = list(row['prot_seq']) ref_seq[row['mt_loc']-1] = row['mt_AA'] return ''.join(ref_seq) dfm['mt_seq'] = dfm.apply(apply_mut, axis=1) # %% dfm.to_csv("/cluster/home/t122995uhn/projects/data/tcga/tcga_maf_davis_pdbbind.csv") ```

Mapped TCGA mutations counts:

Note PDBbind doesn't match any proteins from TCGA when limited to just the test set!
- This is likely because of how we got the sequences for PDBbind (from the pdb files).
To get the correct sequences we need to use the FASTA seq?
- but even this has its own issues since some will not include the full sequence and it wouldn't match with the check we do for mutation location
mt_loc doesnt match reference and db amino acid!
- This means even if we added some wiggle room for pdbbind by allowing at most 50 missing AA we would still fail to match if the mutation location is after those missing AA

1. RAW TCGA counts (from prefiltering step 3)

Sequence length histogram capped at 5K and 1.2K

![image](https://github.com/jyaacoub/MutDTA/assets/50300488/b808fbf5-1880-458f-a7f0-f672f0ecf7fc) ![image](https://github.com/jyaacoub/MutDTA/assets/50300488/66cb0022-a454-41f2-9155-def3fdc2b6d4)

2. Post filtering counts (step 5 from above)

All proteins:

Filter #1 (seq_len)     :  7495 -  5054 =  2441
Filter #2 (ref_AA match):  2441 -     4 =  2437

Test set:

Filter #1 (seq_len)     :  1047 -   791 =   256
Filter #2 (ref_AA match):   256 -     0 =   256

3. Final counts after all filters:

ALL PROTEINS - Sequence length histogram

![image](https://github.com/jyaacoub/MutDTA/assets/50300488/c91af6a1-b2ca-433d-a98f-6ea3749a7ddd)

TEST SET - Sequence length histogram

![image](https://github.com/jyaacoub/MutDTA/assets/50300488/1a733401-30cf-47d1-8c9c-79168cfe42e5)

jyaacoub / MutDTA

TCGA analysis #95

primary tasks

Downloading and getting TCGA MAF files

Matching mutations in TCGA to davis

Difference:

Things to consider for matching:

Columns for validating results:

steps for matching dataset uniprot ids to TCGA mutations

1. Gather data for davis, kiba and PDBbind datasets

Code to combine all datasets into a single csv: https://github.com/jyaacoub/MutDTA/commit/5696a7aff4d0b1ed52a3a29ee77b6db258b62e23

2. Download ALL TCGA projects as a single MAF

3. Prefiltering TCGA

4. Match Uniprot IDs with Mutations

5. Post filtering

Mapped TCGA mutations counts:

1. RAW TCGA counts (from prefiltering step 3)

2. Post filtering counts (step 5 from above)

All proteins:

Test set:

3. Final counts after all filters: