Open jyaacoub opened 1 month ago
Best way is to use Hugo_Symbol
from the MAF file which is the gene name.
## Map davis protein names to uniprot IDs with Biomart
- Can't map any mutated proteins since those are unique and don't have unique uniprot IDs to identify them.
Since davis are all human we can just get the entire db from biomart and filter using pandas
![Pasted image 20240509134605](https://github.com/jyaacoub/MutDTA/assets/50300488/0e64c400-de69-4b94-a5ef-ba2916b80876)
Biomart URL query for our case would be like:
```
http://useast.ensembl.org/biomart/martview/bd870a98ae9d3290205dae6651366761?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id_version|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id_version|hsapiens_gene_ensembl.default.feature_page.external_gene_name|hsapiens_gene_ensembl.default.feature_page.uniprotswissprot|hsapiens_gene_ensembl.default.feature_page.uniprot_gn_symbol&FILTERS=&VISIBLEPANEL=resultspanel
```
This matches 266 proteins from davis!
```python
#%%
import os
import pandas as pd
from src.data_prep.processors import Processor
root_dir = '../data/DavisKibaDataset/davis'
df = pd.read_csv(f"{root_dir}/nomsa_binary_original_binary/full/XY.csv")
df_unique = df.loc[df[['code']].drop_duplicates().index]
df_unique.drop(['SMILE', 'pkd', 'prot_id'], axis=1, inplace=True)
df_unique['code'] = df_unique['code'].str.upper()
df_unique.columns = ['Gene name', 'prot_seq']
#%%
df_mart = pd.read_csv('../downloads/biomart_hsapiens.tsv', sep='\t')
df_mart = df_mart.loc[df_mart[['Gene name', 'UniProtKB/Swiss-Prot ID']].dropna().index]
df_mart['Gene name'] = df_mart['Gene name'].str.upper()
df_mart = df_mart.drop_duplicates(subset=['UniProtKB/Swiss-Prot ID'])
#%%
dfm = df_unique.merge(df_mart, on='Gene name', how='left')
dfm[['Gene name', 'UniProtKB/Swiss-Prot ID']].to_csv('../downloads/davis_biomart_matches.csv')
```
Code
## using Hugo_symbol:
TCGA MAF files have ["Hugo_Symbol"](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/#protected-maf-file-structure) which corresponds to the gene name for that mutation ([HUGO](https://www.genenames.org/))...
```python
#%%
import pandas as pd
df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna()
#%%
df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv")
df = df.loc[df['SWISSPROT'].dropna().index]
df[['Gene', 'SWISSPROT']].head()
df['uniprot'] = df['SWISSPROT'].str.split('.').str[0]
#%%
dfm = df_pid.merge(df, on='uniprot', how='inner')
# %%
dfm[~(dfm.code == dfm.Hugo_Symbol)]
# %%
df_pid = pd.read_csv('../downloads/davis_pids.csv')[['code']]
dfm_cd = df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='left')
```
Code
using biomart matched UniProts: 2324 total
Using raw davis HUGO protein name: 2321 total
biomart finds 3 extra
```python #%% import pandas as pd df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna() #%% df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv") df = df.loc[df['SWISSPROT'].dropna().index] df[['Gene', 'SWISSPROT']].head() df['uniprot'] = df['SWISSPROT'].str.split('.').str[0] #%% dfm = df_pid.merge(df, on='uniprot', how='inner') # %% df_pid = pd.read_csv('../downloads/davis_pids.csv').drop_duplicates(subset='code')[['code']] dfmh= df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='inner') # %% print(f"using biomart matched UniProts: {len(dfm)} total") print(f"Using raw davis HUGO protein name: {len(dfmh)} total") print(f"biomart finds {len(dfm[~(dfm.code == dfm.Hugo_Symbol)]):3d} extra") ```
10-Variant_Classification
Only focus on missense mutations56-Protein_position
tells us <mutation location>/seqlen
seqlen
57-Amino_acids
tells us <reference AA>/<mutated AA>
.
reference AA
at mutation location
(see above)
NOTE: If not enough matching then we need to get the canonical sequence from Uniprot!These come from pathogenicy predictions from ensembl
77-SIFT
a predictive model that returns the impact of mutation on protein function (SIFT).78-PolyPhen
another predictive model similar to SIFT, (PolyPhen-2)not as useful:
123-IMPACT
only looks at the Variant_Classification
to deduce impact see docsTCGABiolinks uses https://gdc.cancer.gov/about-data/publications/mc3-2017 to get the MAF files.
Need the following for matching, see above comment
XY.csv
db_idx,db,code,prot_id,seq_len,prot_seq
Download 49.25 MB compressed MAF data
Variant_Classification
(only focus on Missense_Mutation
for now)Variant_Type
to focus only on SINGLE nucleotide variants (SNP
) -> doesnt matter since there is literally only 2 other rows with non-SNP variants:![[Pasted image 20240510144758.png|300]]Filter #1 (seq_len) : 7495 - 5054 = 2441
Filter #2 (ref_AA match): 2441 - 4 = 2437
```python #%% 1.Gather data for davis,kiba and pdbbind datasets import os import pandas as pd import matplotlib.pyplot as plt from src.analysis.utils import combine_dataset_pids from src import config as cfg df_prots = combine_dataset_pids(dbs=[cfg.DATA_OPT.davis, cfg.DATA_OPT.PDBbind], # WARNING: just davis and pdbbind for now subset='test') #%% 2. Load TCGA data df_tcga = pd.read_csv('../downloads/TCGA_ALL.maf', sep='\t') #%% 3. Pre filtering df_tcga = df_tcga[df_tcga['Variant_Classification'] == 'Missense_Mutation'] df_tcga['seq_len'] = pd.to_numeric(df_tcga['Protein_position'].str.split('/').str[1]) df_tcga = df_tcga[df_tcga['seq_len'] < 5000] df_tcga['seq_len'].plot.hist(bins=100, title="sequence length histogram capped at 5K") plt.show() df_tcga = df_tcga[df_tcga['seq_len'] < 1200] df_tcga['seq_len'].plot.hist(bins=100, title="sequence length after capped at 1.2K") #%% 4. Merging df_prots with TCGA df_tcga['uniprot'] = df_tcga['SWISSPROT'].str.split('.').str[0] dfm = df_tcga.merge(df_prots[df_prots.db != 'davis'], left_on='uniprot', right_on='prot_id', how='inner') # for davis we have to merge on HUGO_SYMBOLS dfm_davis = df_tcga.merge(df_prots[df_prots.db == 'davis'], left_on='Hugo_Symbol', right_on='prot_id', how='inner') dfm = pd.concat([dfm,dfm_davis], axis=0) del dfm_davis # to save mem # %% 5. Post filtering step # 5.1. Filter for only those sequences with matching sequence length (to get rid of nonmatched isoforms) # seq_len_x is from tcga, seq_len_y is from our dataset tmp = len(dfm) # allow for some error due to missing amino acids from pdb file in PDBbind dataset # - assumption here is that isoforms will differ by more than 50 amino acids dfm = dfm[(dfm.seq_len_y <= dfm.seq_len_x) & (dfm.seq_len_x<= dfm.seq_len_y+50)] print(f"Filter #1 (seq_len) : {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}") # 5.2. Filter out those that dont have the same reference seq according to the "Protein_position" and "Amino_acids" col # Extract mutation location and reference amino acid from 'Protein_position' and 'Amino_acids' columns dfm['mt_loc'] = pd.to_numeric(dfm['Protein_position'].str.split('/').str[0]) dfm = dfm[dfm['mt_loc'] < dfm['seq_len_y']] dfm[['ref_AA', 'mt_AA']] = dfm['Amino_acids'].str.split('/', expand=True) dfm['db_AA'] = dfm.apply(lambda row: row['prot_seq'][row['mt_loc']-1], axis=1) # Filter #2: Match proteins with the same reference amino acid at the mutation location tmp = len(dfm) dfm = dfm[dfm['db_AA'] == dfm['ref_AA']] print(f"Filter #2 (ref_AA match): {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}") print('\n',dfm.db.value_counts()) # %% final seq len distribution n_bins = 25 lengths = dfm.seq_len_x fig, ax = plt.subplots(1, 1, figsize=(10, 5)) # Plot histogram n, bins, patches = ax.hist(lengths, bins=n_bins, color='blue', alpha=0.7) ax.set_title('TCGA final filtering for db matches') # Add counts to each bin for count, x, patch in zip(n, bins, patches): ax.text(x + 0.5, count, str(int(count)), ha='center', va='bottom') ax.set_xlabel('Sequence Length') ax.set_ylabel('Frequency') plt.tight_layout() plt.show() # %% Getting updated sequences def apply_mut(row): ref_seq = list(row['prot_seq']) ref_seq[row['mt_loc']-1] = row['mt_AA'] return ''.join(ref_seq) dfm['mt_seq'] = dfm.apply(apply_mut, axis=1) # %% dfm.to_csv("/cluster/home/t122995uhn/projects/data/tcga/tcga_maf_davis_pdbbind.csv") ```
mt_loc
doesnt match reference and db amino acid!
![image](https://github.com/jyaacoub/MutDTA/assets/50300488/b808fbf5-1880-458f-a7f0-f672f0ecf7fc) ![image](https://github.com/jyaacoub/MutDTA/assets/50300488/66cb0022-a454-41f2-9155-def3fdc2b6d4)
Filter #1 (seq_len) : 7495 - 5054 = 2441
Filter #2 (ref_AA match): 2441 - 4 = 2437
Filter #1 (seq_len) : 1047 - 791 = 256
Filter #2 (ref_AA match): 256 - 0 = 256
![image](https://github.com/jyaacoub/MutDTA/assets/50300488/c91af6a1-b2ca-433d-a98f-6ea3749a7ddd)
![image](https://github.com/jyaacoub/MutDTA/assets/50300488/1a733401-30cf-47d1-8c9c-79168cfe42e5)
primary tasks
Downloading and getting TCGA MAF files
Downloading using *TCGAbiolinks*
### What project to use? *"TCGA projects are organized by cancer type or subtype."* Updated projects can be found [here](https://portal.gdc.cancer.gov/analysis_page?app=Projects), but lets just focus on [*TCGA-BRCA*](https://portal.gdc.cancer.gov/v1/projects/TCGA-BRCA) for now - using the [legacy version](https://portal.gdc.cancer.gov/v1/) of the data portal we can gain access to the open version of TCGA-BRCA instead of the newer but closed version ### How to download TCGA-BRCA mafs? Update sys packages ``` sudo apt update sudo apt upgrade -y ``` #### Install R [README](https://cran.r-project.org/bin/linux/ubuntu/fullREADME.html) Add apt repo: ```bash sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/" ``` Install R ```bash sudo apt update sudo apt install r-base ``` Install sys packages required by R ```bash sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev -y ``` #### Install [TCGABiolinks package](https://www.bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html) make sure to run in sudo mode ``` sudo -i R ``` Then install: ```r if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("TCGAbiolinks") ``` #### download TCGA-BRCA | sort(harmonized.data.type) | | :----------------------------- | | Aggregated Somatic Mutation | | ... | | **Masked Somatic Mutation** | | **Masked Somatic Mutation** | | ... | | Methylation Beta Value | | Splice Junction Quantification | ```R library(TCGAbiolinks) query <- GDCquery(project = "TCGA-BRCA", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", file.type = "maf.gz", access = "open") GDCdownload(query) data <- GDCprepare(query) ``` to exit R: ```r q() ``` #### Save TCGAbiolinks R file as CSV: ```R write.csv(data, "TCGA_BRCA_Mutations.csv", row.names = FALSE) ```
Another way is to just use the TCGA portal and download the entire cohort for each project