awesome-TCGA - Curated list of TCGA resources. For more cancer-related notes, see my Cancer_notes
Scripts are being transitioned to use the curatedTCGAData and TCGAutils packages. See also cBioPortalData R interface to TCGA and the cBioPortal API.
Ramos, Marcel, Ludwig Geistlinger, Sehyun Oh, Lucas Schiffer, Rimsha Azhar, Hanish Kodali, Ino de Bruijn et al. "Multiomic Integration of Public Oncology Databases in Bioconductor", JCO Clinical Cancer Informatics 1 (2020), https://doi.org/10.1200/cci.19.00119
Survival analysis in genomics R tutorial/workflow. Cox-type penalized regression (Lasso, adaptive Lasso, Elastic Net, Group-Lasso, Sparse Group-Lasso, SCAD, SIS) and hierarchical Bayesian models for feature selection. Feature stability analysis. TCGA, BRCA, code on GitHub.
Zhao, Zhi, John Zobolas, Manuela Zucknick, and Tero Aittokallio. “Tutorial on Survival Modeling with Applications to Omics Data.” Edited by Jonathan Wren. Bioinformatics, March 5, 2024, btae132. https://doi.org/10.1093/bioinformatics/btae132.
TCGAplot - R package for pan-cancer TCGA analysis. DEG analysis, correlation analysis between gene expression and TMB, MSI, TIME, and promoter methylation. Visualization. Links to other online TCGA analysis tools. Paper
Public data is available through the TCGA2STAT R package, GitHub repo. First, install BiocManager::install("CNTools")
, clone the repository git clone https://github.com/zhandong/TCGA2STAT
, and install from source install.packages("TCGA2STAT_1.2.tar.gz", repos = NULL, type = "source")
First, get the data locally using misc/TCGA_preprocessing.R
script.
data_dir
variable with the path where the downloaded data is stored*.rda
filesdata_dir
variable to the path where the downloaded data is storedTNMplot.Rmd
, but for miRNA. Additionally, the PAM50-specific expression is plotted. The data is saved into an Excel file (TCGA_BRCA_miRNA.xlsx
), with PAM50 annotations.In all other scripts, change Path where the downloaded data is stored, data_dir
variable
survival.Rmd
- a pipeline to run survival analyses for all cancers. Adjust settings cancer = "BRCA"
and selected_genes = "IGFBP3"
to the desired cancer and gene IDs. These IDs should be the same in TCGA_summary.Rmd
that'll summarize the output into Survival analysis summary. Note if subcategories_in_all_cancers <- TRUE
, survival analysis is done for all subcategories and all cancers, time consuming.
Analysis 1
- Selected genes, selected cancers, no clinical annotations. Results are in <selected_genes>.<cancer>.Analysis1
folder.Exploratory
- All genes, selected cancers, no clinical annotations. Not run by default.Analysis 2
- Selected genes, all (or selected) cancers, no clinical annotations. Results are in <selected_genes>.<cancer>.Analysis2
folder.Analysis 3
- Selected genes, all (or, selected) cancers, all unique clinical (sub)groups. Results are in <selected_genes>.<cancer>.Analysis3
folder. Open file global_stats.txt
in Excel, sort by p-value (log-rank test) and explore in which clinical (sub)groups expression of the selected gene affects survival the most.Analysis 4
- Selected genes, selected cancers, all combinations of clinical annotations. Not run by default.Analysis 5
- Analysis 5: Clinical-centric analysis. Selected cancer, selected clinical subcategory, survival difference between all pairs of subcategories. Only run for BRCA and OV cancers. Results are in <selected_genes>.<cancer>.Analysis5
Analysis 6
- Dimensionality reduction of a gene signature across all cancers using NMF, PCA, or FA For each cancer, extracts gene expression of a signature, reduces its dimensionality, plots a heatmap sorted by the first component, biplots, saves eigenvectors in files named after cancer, signature, method. They are used in correlations.Rmd
. Not run by defaultsurvival_Neuroblastoma.Rmd
- survival analysis for Neuroblastoma samples from TARGET database. Prepare the data with misc/cgdsr_preprocessing.R
, see Methods section for data description.
TCGA_summary.Rmd
- summary report for the survival.Rmd
output. In which cancers, and clinical subgroups, expression of the selected gene affects survival the most. Change cancer = "BRCA"
and selected_genes = "IGFBP3"
to the desired cancer and gene IDs. Uses results from <selected_genes>.<cancer>.Analysis*
folders. Survival analysis summary
TCGA_CNV.Rmd
- Separate samples based on copy number variation of one or several genes, do survival and differential expression analysis on the two groups, and KEGG enrichment. An ad hoc analysis, requires manual intervention.
TCGA_stemness.Rmd
- correlation of a selected gene with stemness indices, for details, see Malta, Tathiane M., Artem Sokolov, Andrew J. Gentles, Tomasz Burzykowski, Laila Poisson, John N. Weinstein, Bożena Kamińska, et al. “Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation.” Cell 173, no. 2 (April 2018): 338-354.e15. https://doi.org/10.1016/j.cell.2018.03.034. Results example PDF
TCGA_expression.Rmd
- Expression of selected genes across all TCGA cancers. Used for comparing expression of two or more genes. Change selected_genes <- "XXXX"
, can be multiple. Generates a PDF file with a barplot of log2-expression of selected genes across all cancers, with standard errors. Example
TCGA_correlations.Rmd
- Co-expression analysis of selected gene vs. all others, in selected cancers. Genes best correlating with the selected gene may share common functions, described in the KEGG canonical pathway analysis section. Gene counts are converted to TPM. Multiple cancers, with the ComBat batch correction for the cohort effect. Change selected_genes <- "XXXX"
and cancer <- "YYYY"
variables. The run saves two RData objects, data/Expression_YYYY.Rda
and data/Correlation_XXXX_YYYY.Rda
. This speeds up re-runs with the same settings. The full output is saved in results/Results_XXXX_YYYY.xlsx
. Example PDF, Example Excel
TCGA_correlations_BRCA.Rmd
- Co-expression analysis of selected gene vs. all others, in BRCA stratified by PAM50 annotations. The full output is saved in results/Results_XXXX_BRCA_PAM50.xlsx
.
correlations_one_vs_one.Rmd
- Co-expression analysis of two genes across all cancers. The knitted HTML contains table with correlation coefficients and p-values.
TCGA_DEGs.Rmd
- differential expression analysis of TCGA cohorts separated into groups with high/low expression of selected genes. The results are similar to the correlation
results, most of the differentially expressed genes are also best correlated with the selected genes. This analysis is to explicitly look at the extremes of the selected gene expression and identify KEGG pathways that may be affected. Change selected_genes = "XXXX"
and cancer = "YYYY"
. Manually run through line 254 to see which KEGG pathways are enriched. Then, run the code chunk on line 379 to generate a picture of the selected KEGG pathway, Example, adjust the ![](hsa0YYYY.XXXX.png)
accordingly. Then, recompile the whole document. Example PDF, Example Excel
TCGA_DEGs_clin_subcategories.Rmd
- differential expression analysis between pairs of clinical subgroups, e.g., within "race" clinical category pairs of subcategories, e.g., "black or african american" vs. "white" subgroups. Output is saved in one Excel file CANCERTYPE_DEGs_clin_subcategories.xlsx
with pairs of worksheets, one containing DEGs and another containing enrichment results. Data tables have headers describing individual comparisons and results.
PPI_Networks.Rmd
- experimenting with extracting and visualizing data from different PPI databases, for a selected gene.
Supplemental_R_script_1.R
- a modified script to run gene-specific or global survival analysis, from http://kmplot.com, Source
TCPA_correlation.Rmd
- experimenting with TCPA data.
misc
folder
aracne._networks.R
- experimenting with aracne.networks
R package, https://www.bioconductor.org/packages/release/data/experiment/html/aracne.networks.html
calc_feature_length.R
- get length for gene symbols, resolving aliases
calcTPM.R
- function to calculate TPMs from gene counts, from https://github.com/AmyOlex/RNASeqBits/tree/master/R
clinical_annotation_merge_BRCA.R
- merging XENA_classification.csv
and BRCA_with_TP53_mutation.tsv
into BRCA_XENA_clinical.csv
featureCounts2TPM.Rmd
- convert featureCounts output to gene symbol-annotated TPMs
cgdsr.R
- exploring the Cancer Genomic Data Server, http://www.cbioportal.org/study?id=msk_impact_2017, http://www.cbioportal.org/cgds_r.jsp, https://cran.r-project.org/web/packages/cgdsr/vignettes/cgdsr.pdf
cgdsr_preprocessing.R
- preprocessing the data to the format used in the scripts. Currently, processes TARGET Neuroblastoma data
overlap_significance.R
- simple example of Fisher's exact test
PCA.R
- exercises on dimensionality reduction of gene signatures
RTCGA.R
- experimenting with RTCGA
package, https://bioconductor.org/packages/release/bioc/html/RTCGA.html
TCGA_preprocessing.R
- utilities for download and formatting of TCGA data. Use load_data
and summarize_data
functions to load cancer-specific expression and clinical data.
survplot_0.0.7.tar.gz
- package needed for survival plots
XENA_BRCA.R
- Exploring data from Xena UCSC genome browser, https://xenabrowser.net/datapages/?cohort=TCGA%20Breast%20Cancer%20(BRCA)
data.TCGA
folder. Some data are absent from the repository because of large size - download through links.
BRCA_with_TP53_mutation.tsv
- 355 TCGA samples with TP53 mutations, Source
CCLE_Cell_lines_annotations_20181226.txt
- CCLE cell line annotations, from https://portals.broadinstitute.org/ccle/data
CCR-13-0583tab1.xlsx
- TNBCtype predictions for 163 primary tumors in TCGA considered to be TNBC, classification into six TNBC subtypes. See http://cbc.mc.vanderbilt.edu/tnbc/index.php for details. "UNC" - unclassified. Supplementary table 1 from Mayer, Ingrid A., Vandana G. Abramson, Brian D. Lehmann, and Jennifer A. Pietenpol. “New Strategies for Triple-Negative Breast Cancer--Deciphering the Heterogeneity.” Clinical Cancer Research: An Official Journal of the American Association for Cancer Research 20, no. 4 (February 15, 2014): 782–90. doi:10.1158/1078-0432.CCR-13-0583.
Immune_resistant_program.xlsx
- A gene expression program associated with T cell exclusion and immune evasion. Supplementary Table S4 - genes associated with the immune resistance program, described in Methods. Jerby-Arnon, Livnat, Parin Shah, Michael S. Cuoco, Christopher Rodman, Mei-Ju Su, Johannes C. Melms, Rachel Leeson, et al. “A Cancer Cell Program Promotes T Cell Exclusion and Resistance to Checkpoint Blockade.” Cell 175, no. 4 (November 2018): 984-997.e24. https://doi.org/10.1016/j.cell.2018.09.006.
Lehmann_2019_Data1_BRCA_subtypes.xlsx
- Subtype annotation, ER, PR and HER2 calls for TCGA, CPTAC, METABRIC, MET500 samples. From Lehmann 2019 et al.
Lehmann_2019_Data2_TNBCsubtype.xlsx
- TNBCsubtype clinical information and cell type, mutational, immune signatures, for each TNBCtype subtype. From Lehmann 2019 et al.
gene_signatures_323.xls
- 323 gene signatures from Fan, Cheng, Aleix Prat, Joel S. Parker, Yufeng Liu, Lisa A. Carey, Melissa A. Troester, and Charles M. Perou. “Building Prognostic Models for Breast Cancer Patients Using Clinical Variables and Hundreds of Gene Expression Signatures.” BMC Medical Genomics 4 (January 9, 2011): 3. https://doi.org/10.1186/1755-8794-4-3.
PAM50_classification.txt
- sample classification into PAM50 types
patientsAll.tsv
- TCGA sample clinical information, including PAM50, from https://tcia.at/home
TCGA_489_UE.k4.txt
- Ovarian cancer classification into four subtypes, from https://github.com/aedin/OvarianCancerSubtypes/data/23257362
TCGA_Ancestry.xlsx
- Admixture and Ethnicity Calls of all TCGA samples. Table S1 from Carrot-Zhang, Jian, Nyasha Chambwe, Jeffrey S. Damrauer, Theo A. Knijnenburg, A. Gordon Robertson, Christina Yau, Wanding Zhou, et al. “Comprehensive Analysis of Genetic Ancestry and Its Molecular Correlates in Cancer.” Cancer Cell 37, no. 5 (May 2020): 639-654.e6. https://doi.org/10.1016/j.ccell.2020.04.012.
TCGA_cancer_counts.csv
- number of samples per cancer. Created by misc/TCGA_preprocessing.R
TCGA_cancers.xlsx
- TCGA cancer abbreviations, from http://www.liuzlab.org/TCGA2STAT/CancerDataChecklist.pdf
TCGA_genes.txt
- genes measured in TCGA RNA-seq experiments
TCGA_immune.xlsx
- Table S1 download. PanImmune Feature Matrix of Immune Characteristics. From Supplementary Material, Thorsson, Vésteinn, David L. Gibbs, Scott D. Brown, Denise Wolf, Dante S. Bortone, Tai-Hsien Ou Yang, Eduard Porta-Pardo, et al. “The Immune Landscape of Cancer.” Immunity, April 2018 - TCGA Immune signatures, six immune subtypes. Manually compiled immune gene lists, references in the text. Classification of each TCGA sample in Table S1. M1 macrophages and lymphocyte expression signature in general associated with improved OS.
TCGA_isoforms.xlsx
- Isoform switching analysis of TCGA data, tumor vs. normal. Consequences, survival prediction. Using IsoformSwitchAnalyzeR R package. Supplementary Table 1 - gene- and isoforms differentially expressed in all cancers. From Vitting-Seerup, Kristoffer, and Albin Sandelin. “The Landscape of Isoform Switches in Human Cancers.” Molecular Cancer Research 15, no. 9 (September 2017): 1206–20. https://doi.org/10.1158/1541-7786.MCR-16-0459.
TCGA_purity.xlsx
- Tumor purity estimates for TCGA samples. Tumor purity estimates according to four methods and the consensus method for all TCGA samples with available data. https://www.nature.com/articles/ncomms9971#supplementary-information. Supplementary Data 1 from Aran, Dvir, Marina Sirota, and Atul J. Butte. “Systematic Pan-Cancer Analysis of Tumour Purity.” Nature Communications 6, no. 1 (December 2015). https://doi.org/10.1038/ncomms9971.
TCGA_sample_types.xlsx
- Cancer types and subtypes for all TCGA samples. Includes BRCA subtypes, and subtyping of other cancers, where applicable. PMID: 29625050. Source
TCGA_stemness.xlsx
- Supplementary Table 1 - stemness indices for all TCGA samples. Stemness indices built from various data: mRNAsi - gene expression-based, EREG-miRNAsi - epigenomic- and gene expression-baset, mDNAsi, EREG-mDNAsi - same but methylation-based, DMPsi - differentially methylated probes-based, ENHsi - enhancer-based. Each stemness index (si) ranges from low (zero) to high (one) stemness. From Malta, Tathiane M., Artem Sokolov, Andrew J. Gentles, Tomasz Burzykowski, Laila Poisson, John N. Weinstein, Bożena Kamińska, et al. “Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation.” Cell 173, no. 2 (April 2018): 338-354.e15. https://doi.org/10.1016/j.cell.2018.03.034.
TCGA.bib
- BibTex of TCGA-related references
TCPA_proteins.txt
- List of 224 proteins profiled by RPPA technology. The Cancer Proteome Atlas, http://tcpaportal.org/tcpa/. Data download: http://tcpaportal.org/tcpa/download.html. Paper: http://cancerres.aacrjournals.org/content/77/21/e51
XENA_classification.csv
- PAM50 and other clinical data, Source
Sample annotations by ovarian cancer subtypes. https://github.com/aedin/OvarianCancerSubtypes
Uhlen, Mathias, Cheng Zhang, Sunjae Lee, Evelina Sjöstedt, Linn Fagerberg, Gholamreza Bidkhori, Rui Benfeitas, et al. “A Pathology Atlas of the Human Cancer Transcriptome.” Science (New York, N.Y.) 357, no. 6352 (August 18, 2017). doi:10.1126/science.aan2507. http://science.sciencemag.org/content/357/6352/eaan2507
Supplementary material http://science.sciencemag.org/content/suppl/2017/08/16/357.6352.eaan2507.DC1
Table S2
- summary of tissue specific expression for each gene, in normal and cancer tissues. Table S6
- summary of survival prognostic value, with a simple "favorable/unfavorable" label for each gene. Each worksheet corresponds to a different cancer. Table S8
- per-gene summary, in which cancers it is prognostic of survival. The Metastatic Breast Cancer Project is a patient-driven initiative. This study includes genomic data, patient-reported data (pre-pended as PRD), medical record data (MedR), and pathology report data (PATH). All of the titles and descriptive text for the clinical data elements have been finalized in partnership with numerous patients in the project. As these data were generated in a research, not a clinical, laboratory, they are for research purposes only and cannot be used to inform clinical decision-making. All annotations have been de-identified. More information is available at www.mbcproject.org.
Data download: http://www.cbioportal.org/study?id=brca_mbcproject_wagle_2017#summary. Data includes 78 patients, 103 samples, sample-specific clinical annotations, Putative copy-number from GISTIC, MutSig regions