SMAGEXP (Statistical Meta-Analysis for Gene EXPression) for Galaxy is a Galaxy tool suite providing a unified way to carry out meta-analysis of gene expression data, while taking care of their specificities. It handles microarray data from Gene Expression Omnibus (GEO) database or custom data from affymetrix microarrays. These data are then combined to carry out meta-analysis using metaMA package. SMAGEXP also offers to combine Next Generation Sequencing (NGS) RNA-seq analysis from DESeq2 results thanks to metaRNASeq package. In both cases, key values, independent from the technology type, are reported to judge the quality of the meta-analysis.
SMAGEXP is available on the galaxy main toolshed
SMAGEXP dependencies are available through conda either on bioconda or r conda channels.
If you want to manually install the SMAGEXP dependencies, without conda, these are the required R packages.
From bioconductor :
From CRAN :
A dockerized version of Galaxy containing SMAGEXP, based on bgruening galaxy-stable is also available.
At first you need to install Docker. Please follow the very good instructions from the Docker project.
After the successful installation, all you need to do is:
docker run -d -p 8080:80 -p 8021:21 -p 8022:22 sblanck/galaxy-smagexp
If you already have run galaxy-smagexp with docker and want to fetch the last docker image of galaxy-smagexp, type
docker pull sblanck/galaxy-smagexp
docker run -d -p 8080:80 -p 8021:21 -p 8022:22 sblanck/galaxy-smagexp
Then, you just need to open a web browser (chrome or firefox are recommanded) and type
localhost:8080
into the adress bar to access Galaxy running SMAGEXP.
The Galaxy Admin User has the username admin@galaxy.org
and the password admin
. In order to use some features of Galaxy, like import history, one has to be logged in with this username and password.
Docker images are "read-only", all your changes inside one session will be lost after restart. This mode is useful to present Galaxy to your colleagues or to run workshops with it. To install Tool Shed repositories or to save your data you need to export the calculated data to the host computer.
Fortunately, this is as easy as:
docker run -d -p 8080:80 \
-v /home/user/galaxy_storage/:/export/ \
sblanck/galaxy-smagexp
For more information about the parameters and docker usage, please refer to https://github.com/bgruening/docker-galaxy-stable/blob/master/README.md#Usage
SMAGEXP is able to perform analysis from 3 different data sources :
SMAGEXP can fetch data directly from GEO database, thanks to the GEOQuery R package.
The inputs for each individual dataset are :
The outputs are :
Exemple of a .cond file
GSM80460 series of 16 tumors GSM80460 OSCE-2T SERIES OF 16 TUMORS
GSM80461 series of 16 tumors GSM80461 OSCE-4T Series of 16 Tumors
GSM80461 series of 16 tumors GSM80462 OSCE-6T Series of 16 Tumors
GSM80476 series of 4 normals GSM80476 OSCE-2N Series of 4 Normals
GSM80477 series of 4 normals GSM80477 OSCE-9N Series of 4 Normals
.cond file is a text file, containing 3 columns, separated by tabs, summarizing the conditions of the experiment.
When extracting data from GEO database, SMAGEXP automatically generates a .cond files based on the metadata of the experiment.
SMAGEXP handles affymetrix .CEL files. .CEL files have to be normalized with QCnormalization tool. This tool normalizes data and allows the user to check quality.
The inputs are
The outputs are
Import custom data tool imports data stored in a tabular text file. A few normalization methods are proposed, but it is possible to skip the normalization step, by choosing "none" in the normalization methods options. Therefore, this tool is of special interest when the input dataset has been previously normalized.
The inputs are :
The text file has to be first uploaded by Galaxy's Get Data -> upload file
Example of a header of input tabular text file
"" "GSM80460" "GSM80461" "GSM80462" "GSM80463" "GSM80464"
"1007_s_at" -0.0513991525066443 0.306845500314283 0.0854246562526777 -0.142417044615852 0.0854246562526777
"1053_at" -0.187707155126729 -0.488026018218199 -0.282789700980404 0.160920188181103 0.989865622866287
"117_at" 0.814755482809874 -2.15842936260448 -0.125006361067033 -0.256700472111743 0.0114956388378294
"121_at" -0.0558912008920451 -0.0649174766813385 0.49467161164755 -0.0892673380970874 0.113700499164728
"1294_at" 0.961993677420255 -0.0320936297098533 -0.169744675832317 -0.0969617298870879 -0.181149439104566
"1316_at" 0.0454429707611671 0.43616183931445 -0.766111939825723 -0.182786075741673 0.599317793698226
"1405_i_at" 2.23450132056221 0.369606070031838 -1.06190243892591 -0.190997225060914 0.595503660502742
The corresponding .cond file should look like this :
GSM80460 series of 16 tumors GSM80460 OSCE-2T SERIES OF 16 TUMORS
GSM80461 series of 16 tumors GSM80461 OSCE-4T Series of 16 Tumors
GSM80461 series of 16 tumors GSM80462 OSCE-6T Series of 16 Tumors
GSM80476 series of 4 normals GSM80476 OSCE-2N Series of 4 Normals
GSM80477 series of 4 normals GSM80477 OSCE-9N Series of 4 Normals
The .cond file is a text file, containing 3 columns, separated by tabs, summarizing the conditions of the experiment.
Even if the .cond file is not needed to import data from custom matrix, it will be required in the limma analysis step, and should be manually generated by the user.
The outputs are
The Limma analysis tool performs single analysis either of data previously retrieved from GEO database or normalized affymetrix .CEL files data. Given a .cond file, it runs a standard limma differential expression analysis.
The inputs are
The .cond file is either generated by the GEOquery tool or must be manually generated by the user for data imported from .CEL files or custom matrix data.
A .cond file should look like this.
GSM80460 series of 16 tumors GSM80460 OSCE-2T SERIES OF 16 TUMORS
GSM80461 series of 16 tumors GSM80461 OSCE-4T Series of 16 Tumors
GSM80461 series of 16 tumors GSM80462 OSCE-6T Series of 16 Tumors
GSM80476 series of 4 normals GSM80476 OSCE-2N Series of 4 Normals
GSM80477 series of 4 normals GSM80477 OSCE-9N Series of 4 Normals
The .cond file is a text file, containing 3 columns, separated by tabs, summarizing the conditions of the experiment.
The outputs are :
Given several .rdata files from the limma analysis tool, the microarray meta-analysis tool runs a meta-analysis using the metaMA R package.
The Inputs are :
The outputs are :
Venn Diagram or UpSet diagram (when the number of studies is greater than 3) summarizing the results of the meta-analysis
A list of indicators to evaluate the quality of the performance of the meta-analysis
Fully sortable and requestable table, with gene annotations and hypertext links to NCBI gene database.
Plots and results generated by the microarray meta-analysis tool
recount2 is an online resource consisting of RNA-seq gene and exon counts as well as coverage bigWig files for 2041 different studies. The recount galaxy tool wraps the Bioconductor R package recount and fetch gene counts from one experiment.
Input is
Outputs are
recount tool form
Count files retrieved by the recount galaxy tool can be analyzed with the DESeq2 tool available on the galaxy toolshed For more information of how this tool works, see the help section on the tool or refer to the Run DESeq2 section of the Step by step example of a RNA-seq meta-analysis chapter.
The RNA-seq data meta-analysis tool relies on DESeq2 results. It uses the metaRNAseq R package from CRAN.
It outputs a Venn diagram or an UpSet diagram (when the number of studies is greater than 2) and the same indicators as in the microarray meta-analysis tool for both Fisher and inverse normal p-value combinations.
The inputs are :
The outputs are :
A list of indicators to evaluate the quality of the performance of the meta-analysis
It also generates a text file containing summarization of the results of each single analysis and meta-analysis. Potential conflicts between single analyses are indicated by zero values in the "signFC" column.
Header of RNA-seq data meta-analysis text results
In order to import histories into galaxy, you have to be logged in your galaxy instance. If you use the dockerized version of galaxy, The Galaxy Admin User has the username admin@galaxy.org
and the password admin
.
The full history of this example is available at :
https://github.com/sblanck/smagexp/raw/master/examples/Galaxy-History-Example-of-micro-array-meta-analysis.tar.gz
.CEL files used in this example are extracted from the GEO dataset GSE13601. We picked up 6 .CEL files (to simplify the example) which can be found here :
https://github.com/sblanck/smagexp/raw/master/examples/GSM342582.CEL
https://github.com/sblanck/smagexp/raw/master/examples/GSM342583.CEL
https://github.com/sblanck/smagexp/raw/master/examples/GSM342584.CEL
https://github.com/sblanck/smagexp/raw/master/examples/GSM342585.CEL
https://github.com/sblanck/smagexp/raw/master/examples/GSM342586.CEL
https://github.com/sblanck/smagexp/raw/master/examples/GSM342587.CEL
We also manually generated a .cond file corresponding to these 6 .CEL files.
https://raw.githubusercontent.com/sblanck/smagexp/master/examples/Celfiles.cond
To easily upload these data on Galaxy, it is possible to load an existing history containing all these data :
https://github.com/sblanck/smagexp/raw/master/examples/Galaxy-History-Example-Data.tar.gz
Download this history on your computer and import it in galaxy. If you choose to manually upload these data on Galaxy don't forget to specify the type of each file (.CEL or .cond) as Galaxy won't auto-detect them.
The GSE accession ID is needed (i.e GSE3524). The log2 transformation is set to auto in this example.
GEOQuery tool form
The tool produces
Header of the tabular text file generated by GEOquery tool
Condition file generated by GEOquery tool
The limma analysis tool takes an .rdata and a .cond file as inputs.
Limma analysis tool form
It generates a html report with boxplots, p-value histogram, a volcano plot and a table listing the differentially expressed genes.
Limma analysis tool graphic outputs
Limma analysis tool table output
This table gives access to gene annotation on NCBI and gene ontology websites.
NCBI gene annotations
The QC normalisation tool only needs a list of .CEL files and a normalization method.
QCnormalization tool form
It generates an html report showing microarray pseudo-images, boxplots and MA plots for raw and normalized data. It also generates an .rdata file containing normalized data in a ExpressionSet object for further analysis with limma.
QCnormalization tool (partial) results with microarray pseudo-images, boxplots and MA-plots for raw data
The limma analysis tool takes an .rdata and a .cond file as inputs.
Limma analysis tool form
It generates a html report with boxplots, p-value histogram, a volcano plot, and a table listing the differentially expressed genes.
Limma analysis tool graphic outputs
Limma analysis tool table output
The meta-analysis tool only needs the .rdata files produced by the limma analysis tool.
metaMA tool form
The outputs are :
A Venn diagram or an UpSet diagram summarizing the results of the meta-analysis
A list of indicators to evaluate the quality of the performance of the meta-analysis
Fully sortable and requestable table, with gene annotations and hypertext links to NCBI gene database.
MetaMA tool results
In order to import histories into galaxy, you have to be logged in your galaxy instance. If you use the dockerized version of galaxy, The Galaxy Admin User has the username admin@galaxy.org
and the password admin
.
The full history of this example is available at :
https://github.com/sblanck/smagexp/raw/master/examples/Galaxy-History-Example-of-RNA-seq-meta-analysis.tar.gz
Three dataset from the recount database are used in this example :
The first step is to fetch raw count data from Recount. The galaxy recount tool wraps the recount bioconductor R package. It only needs the accession ID of the experiment.
Recount tool form
The recount tool generates one count file per sample of the experiment, in order to be analysed with DESeq2.
Example of header of a count file generated by recount tool
In this example 17 count files are generated
The DESeq2 tool is available on the galaxy toolshed. It takes the count files generated by the recount tool as inputs. It also wraps others DESeq2 parameters (see DESeq2 tool help section for more information). In this example we keep the 6 invasive lung cancer samples to compare with the 5 normal samples.
DESeq2 form
It generates a pdf report and a tabular text results file.
DESeq2 results header
We perform the same kind of analysis on the second recount dataset (SRP028180)
Recount tool form
In this example 24 count files are generated
In this example we keep the 10 tumor samples to compare with the 7 normal samples.
DESeq2 tool form
At last, we perform the 3rd analysis on the third recount dataset (SRP058237)
Recount tool form
In this example 17 count files are generated
In this example we keep the 7 tumor samples to compare with the 10 adjacent samples.
DESeq2 tool form
MetaRNASeq tool takes several results from DESeq2 tool and performs a meta-analysis. It requires text results files from DESeq2 and the number of replicates of each analysis. In this example we have 17, 17, and 11 replicates for each of the 3 analysis. It also requires a FDR threshold for genes to be declared differentially expressed (default is 0.05)
MetaRNAseq tool form
The tool outputs 2 datasets :
Upset diagram and statistical indicators of the meta-analysis
Header of the text file generated by the metaRNAseq tool
It summarizes the results of each single analysis and meta-analysis. Potential conflicts between single analyses are indicated by zero values in the "signFC" column.