Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously

Published article: Foltz, S. M., Greene, C. S. & Taroni, J. N. Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously. Commun Biol 6, 222 (2023). https://doi.org/10.1038/s42003-023-04588-6

Table of Contents generated with DocToc

Summary
Requirements
- Obtaining and running the Docker container
Download data from The Cancer Genome Atlas (TCGA)
Recreate manuscript results
Methods
- Machine Learning Pipeline
Running individual experiments
- Machine learning
- Other scripts
Manuscript versions
Funding

Summary

We performed a series of supervised and unsupervised machine learning evaluations, as well as differential expression and pathway analyses, to assess which normalization methods are best suited for combining data from microarray and RNA-seq platforms.

We evaluated seven normalization approaches for all methods:

log-transformation (LOG)
non-paranormal transformation (NPN)
quantile normalization (QN)
quantile normalization via CrossNorm
quantile normalization followed by z-scoring (QN-Z)
Training Distribution Matching (TDM)
z-scoring (Z)

We also explored the use of Seurat to normalize array and RNA-seq data. Due to low sample numbers at the edges of our titration protocol, many experimental conditions could not be integrated.

Requirements

We recommend using the docker image envest/rnaseq_titration_results:R-4.1.2 to handle package and dependency installation. See docker/R-4.1.2/Dockerfile for more information.

Our analysis (v2.3) was run using 7 cores on an AWS instance with 16 cores, 128 GB memory, and an allocated 1 TB of space.

Obtaining and running the Docker container

Pull the docker image using:

docker pull envest/rnaseq_titration_results:R-4.1.2

Then run the command to start up a container, replacing [PASSWORD] with your own password:

docker run --mount type=bind,target=/home/rstudio,source=$PWD -e PASSWORD=[PASSWORD] -p 8787:8787 envest/rnaseq_titration_results:R-4.1.2

Navigate to http://localhost:8787/ and login to the RStudio server with the username rstudio and the password you set above.

Download data from The Cancer Genome Atlas (TCGA)

TCGA data from 520 breast cancer (BRCA) patients used for these analyses is available at zenodo.

Data from 150 glioblastoma (GBM) patients is available from the Genomic Data Commons PanCan Atlas.

To download data, run the data download script in the top directory:

bash download_TCGA_data.sh

Recreate manuscript results

After data has been downloaded, running

bash run_all_analyses_and_plots.sh [cancer type]

where

[cancer type] is both, BRCA or GBM

with v2.3 of this repository will reproduce the results presented in our manuscript. We recommend running all analyses within the project Docker container.

Methods

Machine Learning Pipeline

Here's a schematic overview of our machine learning experiments:

Overview of supervised and unsupervised machine learning experiments.

Matched samples run on both microarray and RNA-seq were split into a training (2/3) and holdout set (1/3).
RNA-seq samples were "titrated" into the training set, 10% at a time (0-100%), replacing their matched array samples, resulting in eleven training sets for each normalization method.
Machine learning applications:
- Supervised learning: We trained three classifiers – LASSO, linear SVM, and Random Forest — on each training set and tested them on the microarray and RNA-seq holdout sets. The models were trained to predict tumor subtype (both cancer types have 5 subtypes) and the binary mutation status of TP53 and PIK3CA.
- Unsupervised learning: We projected holdout sets onto and back out of the training set space using Principal Components Analysis to obtain reconstructed holdout sets. We then used the trained subtype classifiers to predict on the reconstructed holdout sets. PLIER (Pathway-Level Information ExtractoR) identified coordinated sets of genes in each cancer type.

Running individual experiments

Machine learning

To run the machine learning pipeline, run in top directory:

bash run_machine_learning_experiments.sh [cancer type] [prediction task] [n cores]

where

[cancer type] is BRCA or GBM
[prediction task] is subtype, TP53, or PIK3CA
[n cores] is the number of cores you want to run in parallel

Other scripts

To search for the number of publicly available microarray and RNA-seq samples from GEO and ArrayExpress, run

python3 search_geo_arrayexpress.py

and check the output in results/array_rnaseq_ratio.

To compare PLIER pathways that are more frequently identified using the full sample size data compared to half sample size data, run

Rscript -e "rmarkdown::render('8-PLIER_pathways_analysis.Rmd', clean = TRUE)"

and examine the results in 8-PLIER_pathways_analysis.nb.html.

Manuscript versions

Version	Relevant links
v2.3	Published article, Figshare+ data, Data for plots
v2.2	Figshare+ data, Data for plots
v2.1	Figshare+ data, Data for plots
v2.0	Figshare+ data, Data for plots
v1.1	Figshare full results
v1.0	Pre-print

Funding

This work was supported by the Gordon and Betty Moore Foundation [GBMF 4552], Alex's Lemonade Stand Foundation [GR-000002471], and the National Institutes of Health [T32-AR007442, U01-TR001263, R01-CA237170, K12GM081259].

FAQ

Can I normalize array data to match RNA-seq data?

We generally do not advise this study design. We expect array data to have less precision at higher expression levels due to saturation, while counts-based RNA-seq data does not have that problem. We recommend reshaping the data expected to have more dynamic range (RNA-seq) to fit the narrower and less precise (array) distribution. See also TDM FAQs.

greenelab / RNAseq_titration_results

readme