Each data set combines multiple studies with different gene sets and clincal variables that can be analyzed as an ensemble/meta-analysis or merged into one large matrix. Meta-analysis is more powerful with standard statistical methods due to data loss when variables from different studies are aligned and merged.
CoINcIDE is an unsupervised meta-graph clustering algorithm used to sub-type tumors from gene expression profiles from multiple patient study cohorts: paper, author's github.
The author's github includes useful but outdated R code for processing and merging microarray data sets from GEO. Updating the code for current use is ongoing, see cancer branch of fork of author's github repository in CoINcIDE.
curatedBreastData: 4,923 breast tumor microarray expression sets from 2,613 patients in 20 studies published as a Bioconductor R package [paper].
Haibe-Kains et. al (2012) develop a gaussian mixture subtype classification model (SCM) using microarray expression levels of three key genes (ER, HER2, and AURKA) from breast cancer tumor samples and compare it favorably to two other published SCMs and three published hierarchical clustering based single sample predictor (SSP) model classifiers, including the commercially available PAM50 molecular subtyping system, using dozens to hundreds of genes. An associated Bioconductor package genefu and the code to reproduce their findings are available.
MetaGxBreast: 39 breast cancer microarray expression datasets spanning 10,004 samples. Survival information is available for 6,847 patients, including overall survival (n = 4,425), metastasis free survival (n = 2,695), and relapse free survival (n = 1,858) [package][paper].
pdf copies of papers are in the lit dircetory
1073 samples already included in MetaGXBreast dataset. These samples have other -omics assay data available for data integration analyses (whole genome sequencing, DNA methylation, proteomics, etc)
Link to multi-omics breast cancer sub-typing paper with analysis data available from TCGA. This a good review for understanding current thinking about breast cancer.
TCGA pan-cancer literature index
163 normal tissue frome breast cancer patients search table
1,145 blood samples bc search table
state of the art tumor classification: Dynamic Classification Using Case-Specific Training Cohorts Outperforms Static Gene Expression Signatures in Breast Cancer