jyaacoub / MutDTA

Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.
1 stars 2 forks source link

TCGA analysis #95

Open jyaacoub opened 1 month ago

jyaacoub commented 1 month ago

primary tasks

Downloading and getting TCGA MAF files

Downloading using *TCGAbiolinks*

### What project to use? *"TCGA projects are organized by cancer type or subtype."* Updated projects can be found [here](https://portal.gdc.cancer.gov/analysis_page?app=Projects), but lets just focus on [*TCGA-BRCA*](https://portal.gdc.cancer.gov/v1/projects/TCGA-BRCA) for now - using the [legacy version](https://portal.gdc.cancer.gov/v1/) of the data portal we can gain access to the open version of TCGA-BRCA instead of the newer but closed version ### How to download TCGA-BRCA mafs? Update sys packages ``` sudo apt update sudo apt upgrade -y ``` #### Install R [README](https://cran.r-project.org/bin/linux/ubuntu/fullREADME.html) Add apt repo: ```bash sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/" ``` Install R ```bash sudo apt update sudo apt install r-base ``` Install sys packages required by R ```bash sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev -y ``` #### Install [TCGABiolinks package](https://www.bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html) make sure to run in sudo mode ``` sudo -i R ``` Then install: ```r if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("TCGAbiolinks") ``` #### download TCGA-BRCA | sort(harmonized.data.type) | | :----------------------------- | | Aggregated Somatic Mutation | | ... | | **Masked Somatic Mutation** | | **Masked Somatic Mutation** | | ... | | Methylation Beta Value | | Splice Junction Quantification | ```R library(TCGAbiolinks) query <- GDCquery(project = "TCGA-BRCA", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", file.type = "maf.gz", access = "open") GDCdownload(query) data <- GDCprepare(query) ``` to exit R: ```r q() ``` #### Save TCGAbiolinks R file as CSV: ```R write.csv(data, "TCGA_BRCA_Mutations.csv", row.names = FALSE) ```

Another way is to just use the TCGA portal and download the entire cohort for each project

jyaacoub commented 1 month ago

Matching mutations in TCGA to davis

Best way is to use Hugo_Symbol from the MAF file which is the gene name.

Using Biomart to get UniProtIDs

## Map davis protein names to uniprot IDs with Biomart - Can't map any mutated proteins since those are unique and don't have unique uniprot IDs to identify them. Since davis are all human we can just get the entire db from biomart and filter using pandas ![Pasted image 20240509134605](https://github.com/jyaacoub/MutDTA/assets/50300488/0e64c400-de69-4b94-a5ef-ba2916b80876) Biomart URL query for our case would be like: ``` http://useast.ensembl.org/biomart/martview/bd870a98ae9d3290205dae6651366761?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id_version|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id_version|hsapiens_gene_ensembl.default.feature_page.external_gene_name|hsapiens_gene_ensembl.default.feature_page.uniprotswissprot|hsapiens_gene_ensembl.default.feature_page.uniprot_gn_symbol&FILTERS=&VISIBLEPANEL=resultspanel ``` This matches 266 proteins from davis!

Code

```python #%% import os import pandas as pd from src.data_prep.processors import Processor root_dir = '../data/DavisKibaDataset/davis' df = pd.read_csv(f"{root_dir}/nomsa_binary_original_binary/full/XY.csv") df_unique = df.loc[df[['code']].drop_duplicates().index] df_unique.drop(['SMILE', 'pkd', 'prot_id'], axis=1, inplace=True) df_unique['code'] = df_unique['code'].str.upper() df_unique.columns = ['Gene name', 'prot_seq'] #%% df_mart = pd.read_csv('../downloads/biomart_hsapiens.tsv', sep='\t') df_mart = df_mart.loc[df_mart[['Gene name', 'UniProtKB/Swiss-Prot ID']].dropna().index] df_mart['Gene name'] = df_mart['Gene name'].str.upper() df_mart = df_mart.drop_duplicates(subset=['UniProtKB/Swiss-Prot ID']) #%% dfm = df_unique.merge(df_mart, on='Gene name', how='left') dfm[['Gene name', 'UniProtKB/Swiss-Prot ID']].to_csv('../downloads/davis_biomart_matches.csv') ```

- proteins with no matches were mutated or phosphorylated.

Using `Hugo_Symbol`

## using Hugo_symbol: TCGA MAF files have ["Hugo_Symbol"](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/#protected-maf-file-structure) which corresponds to the gene name for that mutation ([HUGO](https://www.genenames.org/))...

Code

```python #%% import pandas as pd df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna() #%% df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv") df = df.loc[df['SWISSPROT'].dropna().index] df[['Gene', 'SWISSPROT']].head() df['uniprot'] = df['SWISSPROT'].str.split('.').str[0] #%% dfm = df_pid.merge(df, on='uniprot', how='inner') # %% dfm[~(dfm.code == dfm.Hugo_Symbol)] # %% df_pid = pd.read_csv('../downloads/davis_pids.csv')[['code']] dfm_cd = df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='left') ```

Difference:

using biomart matched UniProts:    2324 total
Using raw davis HUGO protein name: 2321 total
biomart finds   3 extra
Code

```python #%% import pandas as pd df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna() #%% df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv") df = df.loc[df['SWISSPROT'].dropna().index] df[['Gene', 'SWISSPROT']].head() df['uniprot'] = df['SWISSPROT'].str.split('.').str[0] #%% dfm = df_pid.merge(df, on='uniprot', how='inner') # %% df_pid = pd.read_csv('../downloads/davis_pids.csv').drop_duplicates(subset='code')[['code']] dfmh= df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='inner') # %% print(f"using biomart matched UniProts: {len(dfm)} total") print(f"Using raw davis HUGO protein name: {len(dfmh)} total") print(f"biomart finds {len(dfm[~(dfm.code == dfm.Hugo_Symbol)]):3d} extra") ```

jyaacoub commented 1 month ago

Things to consider for matching:

Columns for validating results:

These come from pathogenicy predictions from ensembl

not as useful:

jyaacoub commented 1 month ago

steps for matching dataset uniprot ids to TCGA mutations

TCGABiolinks uses https://gdc.cancer.gov/about-data/publications/mc3-2017 to get the MAF files.

1. Gather data for davis, kiba and PDBbind datasets

Need the following for matching, see above comment

Code to combine all datasets into a single csv: https://github.com/jyaacoub/MutDTA/commit/5696a7aff4d0b1ed52a3a29ee77b6db258b62e23

2. Download ALL TCGA projects as a single MAF

  1. Go to GDC portal
  2. Click dropdown for cases
  3. Enter selection view for "program"
  4. Check only TCGA
  5. Close out of the dropdown
  6. scroll do the bottom and hit Download 49.25 MB compressed MAF data

    3. Prefiltering TCGA

    • Filter out by Variant_Classification (only focus on Missense_Mutation for now)
    • Maybe also filter by Variant_Type to focus only on SINGLE nucleotide variants (SNP) -> doesnt matter since there is literally only 2 other rows with non-SNP variants:![[Pasted image 20240510144758.png|300]]
    • Filter out sequences longer than 1200, practically speaking any sequence longer than this is not useful since it would take forever to run.

      4. Match Uniprot IDs with Mutations

    • For davis we have to use hugo_symbols but the others should be fine

      5. Post filtering

    • filter for only those sequences with matching sequence length (to get rid of nonmatched isoforms)
    • Filter #1 (seq_len) : 7495 - 5054 = 2441
    • Filter out those that don't have the same reference seq according to the "Protein_position" and "Amino_acids" col
    • Filter #2 (ref_AA match): 2441 - 4 = 2437
Code

```python #%% 1.Gather data for davis,kiba and pdbbind datasets import os import pandas as pd import matplotlib.pyplot as plt from src.analysis.utils import combine_dataset_pids from src import config as cfg df_prots = combine_dataset_pids(dbs=[cfg.DATA_OPT.davis, cfg.DATA_OPT.PDBbind], # WARNING: just davis and pdbbind for now subset='test') #%% 2. Load TCGA data df_tcga = pd.read_csv('../downloads/TCGA_ALL.maf', sep='\t') #%% 3. Pre filtering df_tcga = df_tcga[df_tcga['Variant_Classification'] == 'Missense_Mutation'] df_tcga['seq_len'] = pd.to_numeric(df_tcga['Protein_position'].str.split('/').str[1]) df_tcga = df_tcga[df_tcga['seq_len'] < 5000] df_tcga['seq_len'].plot.hist(bins=100, title="sequence length histogram capped at 5K") plt.show() df_tcga = df_tcga[df_tcga['seq_len'] < 1200] df_tcga['seq_len'].plot.hist(bins=100, title="sequence length after capped at 1.2K") #%% 4. Merging df_prots with TCGA df_tcga['uniprot'] = df_tcga['SWISSPROT'].str.split('.').str[0] dfm = df_tcga.merge(df_prots[df_prots.db != 'davis'], left_on='uniprot', right_on='prot_id', how='inner') # for davis we have to merge on HUGO_SYMBOLS dfm_davis = df_tcga.merge(df_prots[df_prots.db == 'davis'], left_on='Hugo_Symbol', right_on='prot_id', how='inner') dfm = pd.concat([dfm,dfm_davis], axis=0) del dfm_davis # to save mem # %% 5. Post filtering step # 5.1. Filter for only those sequences with matching sequence length (to get rid of nonmatched isoforms) # seq_len_x is from tcga, seq_len_y is from our dataset tmp = len(dfm) # allow for some error due to missing amino acids from pdb file in PDBbind dataset # - assumption here is that isoforms will differ by more than 50 amino acids dfm = dfm[(dfm.seq_len_y <= dfm.seq_len_x) & (dfm.seq_len_x<= dfm.seq_len_y+50)] print(f"Filter #1 (seq_len) : {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}") # 5.2. Filter out those that dont have the same reference seq according to the "Protein_position" and "Amino_acids" col # Extract mutation location and reference amino acid from 'Protein_position' and 'Amino_acids' columns dfm['mt_loc'] = pd.to_numeric(dfm['Protein_position'].str.split('/').str[0]) dfm = dfm[dfm['mt_loc'] < dfm['seq_len_y']] dfm[['ref_AA', 'mt_AA']] = dfm['Amino_acids'].str.split('/', expand=True) dfm['db_AA'] = dfm.apply(lambda row: row['prot_seq'][row['mt_loc']-1], axis=1) # Filter #2: Match proteins with the same reference amino acid at the mutation location tmp = len(dfm) dfm = dfm[dfm['db_AA'] == dfm['ref_AA']] print(f"Filter #2 (ref_AA match): {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}") print('\n',dfm.db.value_counts()) # %% final seq len distribution n_bins = 25 lengths = dfm.seq_len_x fig, ax = plt.subplots(1, 1, figsize=(10, 5)) # Plot histogram n, bins, patches = ax.hist(lengths, bins=n_bins, color='blue', alpha=0.7) ax.set_title('TCGA final filtering for db matches') # Add counts to each bin for count, x, patch in zip(n, bins, patches): ax.text(x + 0.5, count, str(int(count)), ha='center', va='bottom') ax.set_xlabel('Sequence Length') ax.set_ylabel('Frequency') plt.tight_layout() plt.show() # %% Getting updated sequences def apply_mut(row): ref_seq = list(row['prot_seq']) ref_seq[row['mt_loc']-1] = row['mt_AA'] return ''.join(ref_seq) dfm['mt_seq'] = dfm.apply(apply_mut, axis=1) # %% dfm.to_csv("/cluster/home/t122995uhn/projects/data/tcga/tcga_maf_davis_pdbbind.csv") ```

jyaacoub commented 1 month ago

Mapped TCGA mutations counts:

1. RAW TCGA counts (from prefiltering step 3)

Sequence length histogram capped at 5K and 1.2K

![image](https://github.com/jyaacoub/MutDTA/assets/50300488/b808fbf5-1880-458f-a7f0-f672f0ecf7fc) ![image](https://github.com/jyaacoub/MutDTA/assets/50300488/66cb0022-a454-41f2-9155-def3fdc2b6d4)

2. Post filtering counts (step 5 from above)

All proteins:

Filter #1 (seq_len)     :  7495 -  5054 =  2441
Filter #2 (ref_AA match):  2441 -     4 =  2437

Test set:

Filter #1 (seq_len)     :  1047 -   791 =   256
Filter #2 (ref_AA match):   256 -     0 =   256

3. Final counts after all filters:

ALL PROTEINS - Sequence length histogram

![image](https://github.com/jyaacoub/MutDTA/assets/50300488/c91af6a1-b2ca-433d-a98f-6ea3749a7ddd)

TEST SET - Sequence length histogram

![image](https://github.com/jyaacoub/MutDTA/assets/50300488/1a733401-30cf-47d1-8c9c-79168cfe42e5)