Rapid and scalable correlation estimation for compositional data.
FastSpar
is a C++ implementation of the SparCC algorithm which is up to several thousand times faster than the original Python2 release and uses much less memory. The FastSpar
implementation provides threading support and a p-value estimator which accounts for the possibility of repetitious data permutations (see this paper for further details).
If you use this tool, please cite the FastSpar
paper and original SparCC paper:
There are no requirements for using the pre-compiled static binaries on 64-bit linux distributions. Otherwise, there are several libraries which are required for building and running dynamically linked binaries. For further information, see Compiling from source.
FastSpar
can be installed using conda or from source.
To install through conda, use:
conda install -c bioconda -c conda-forge fastspar
Compiling from source requires these libraries and software:
C++11 (gcc-4.9.0+, clang-4.9.0+, etc)
OpenMP 4.0+
Gfortran
Armadillo 6.7+
LAPACK
OpenBLAS
GNU Scientific Library 2.1+
GNU getopt
GNU make
GNU autoconf
GNU autoconf-archive
These dependencies can be install with the following packages on ubuntu 20.04:
build-essential
gfortran
dh-autoreconf
libarmadillo-dev
libopenblas-openmp-dev
libgsl-dev
After meeting the above requirements, compiling and installing FastSpar
from source can be done by:
git clone https://github.com/scwatts/fastspar.git
cd fastspar
./autogen.sh
./configure --prefix=/usr/
make
make install
Once completed, the FastSpar
executables can be run from the command line.
To run FastSpar
, you must have absolute OTU counts in BIOM tsv format file (with no metadata). The fake_data.tsv
(from the original SparCC implementation) will be used as an example:
fastspar --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv
The number of iterations (rounds of SparCC correlation estimation) and exclusion iterations (the number of times highly correlation OTU pairs are discovered and excluded) can also be tweaked:
fastspar --iterations 50 --exclude_iterations 20 --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv
Further, the minimum threshold to exclude correlated OTU pairs can be increased:
fastspar --threshold 0.2 --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv
There are several methods to calculate p-values for inferred correlations. Here we have elected to use a robust permutation based approach. This process involves inferring correlation from random permutations of the original OTU count data. The magnitude of each p-value is related to how often a more extreme correlation is observed for randomly permutated data. In the below example, we calculate p-values from 1000 bootstrap correlations.
First we generate the 1000 bootstrap counts:
mkdir bootstrap_counts
fastspar_bootstrap --otu_table tests/data/fake_data.tsv --number 1000 --prefix bootstrap_counts/fake_data
And then infer correlations for each bootstrap count (running in parallel with all processes available):
mkdir bootstrap_correlation
parallel fastspar --otu_table {} --correlation bootstrap_correlation/cor_{/} --covariance bootstrap_correlation/cov_{/} -i 5 ::: bootstrap_counts/*
From these correlations, the p-values are then calculated:
fastspar_pvalues --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --prefix bootstrap_correlation/cor_fake_data_ --permutations 1000 --outfile pvalues.tsv
If FastSpar
is compiled with OpenMP, threading can be used by invoking --threads <thread_number>
at the command line for several tools:
fastspar --otu_table tests/data/fake_data.txt --correlation median_correlation.tsv --covariance median_covariance.tsv --iterations 50 --threads 10
statmod::permp