Sampling Noise based Inference of Transcription ActivitY : Filtering of Poison noise on a single-cell RNA-seq UMI count matrix
Single-cell RNA sequencing normalization algorithm presented in the publication Bayesian inference of gene expression states from single-cell RNA-seq data - J Breda, M Zavolan, E van Nimwegen - Nature Biotechnology, 2021.
Sanity infers the log expression levels xgc of gene g in cell c by filtering out the Poisson noise on the UMI count matrix ngc of gene g in cell c.
The raw UMI count and normalized datasets mentionned in benchmarking in the associated publication are available on . Files are named [dataset name]_UMI_counts.txt.gz and [dataset name]_[tool name]_normalization.txt.gz.
The scripts used for running the bechmarked normalization methods and for making the figures of the preprint are in the reproducibility folder.
'path/to/text_file'
)GeneID | Cell 1 | Cell 2 | Cell 3 | ... |
---|---|---|---|---|
Gene 1 | 1.0 | 2.0 | 0.0 | |
Gene 2 | 6.0 | 3.0 | 1.0 | |
... |
(Alternatively) Matrix Market File Format: Sparse matrix of UMI counts. Automatically recognized by .mtx
extension of the input file. Named matrix.mtx
by cellranger 2.1.0 and 3.1.0 (10x Genomics). ('path/to/text_file.mtx'
)
genes.tsv
by cellranger 2.1.0 and features.tsv
by cellranger 3.1.0 (10x Genomics). ('path/to/text_file'
)barcodes.tsv
by cellranger 2.1.0 and 3.1.0 (10x Genomics). ('path/to/text_file'
)(optional) Destination folder ('path/to/output/folder'
, default: pwd
)
(optional) Number of threads (integer, default: 4
)
(optional) Print extended output (Boolean, 'true', 'false', '1'
or '0'
, default: false
)
(optional) Minimal and maximal considered values of the variance in log transcription quotients (double, default: vmin=0.001
vmax=50
)
(optional) Number of bins for the variance in log transcription quotients (integer, default: 160
)
(optional) Option to skip cell size normalization (Boolean, 'true', 'false', '1'
or '0'
, default: false
)
log_transcription_quotients.txt: This file contains the estimated values of the log-transcription quotients (LTQs) for each gene in each cell. The LTQ xgc of gene g in cell c corresponds to the estimated logarithm of the fraction of mRNAs in cell c that belong to gene g. The LTQs are thus normalized such that Σg exp(xgc) = 1 for each cell c. In order to get an estimate of the number of mRNAs for gene g in cell c one would thus need to multiply exp(xgc) by the estimated total number of mRNAs M in the cell.
GeneID | Cell 1 | Cell 2 | Cell 3 | ... |
---|---|---|---|---|
Gene 1 | -13.7227 | -13.722 | -13.729 | |
Gene 2 | -9.96744 | -10.2522 | -10.1453 | |
... |
ltq_error_bars.txt : Table with the error-bars on the estimates of the LTQs xgc for each gene g in each cell c.
GeneID | Cell 1 | Cell 2 | Cell 3 | ... |
---|---|---|---|---|
Gene 1 | 0.630111 | 0.630198 | 0.624802 | |
Gene 2 | 0.315551 | 0.325912 | 0.301861 | |
... |
likelihood.txt : This file encodes the posterior distribution of each gene’s true variance in log-expression. For the numerical calculation of this distribution, the variance is a prior assumed to lie in the range [vmin,vmax] and is discretized into Nb bins uniformly on a logarithmic scale. The file contains the matrix with posterior values Pgb for each gene g and each bin b.
Variance | 0.01 | 0.0107 | 0.0114 | ... |
Gene 1 | 0.018 | 0.019 | 0.020 | |
Gene 2 | 0.0006 | 0.0051 | 0.0031 | |
... |
./Sanity <option(s)> SOURCES
Options:
-h,--help Show this help message
-v,--version Show the current version
-f,--file Specify the input transcript count text file (.mtx for Matrix Market File Format)
-mtx_genes,--mtx_gene_name_file Specify the gene name text file (only needed if .mtx input file)
-mtx_cells,--mtx_cell_name_file Specify the cell name text file (only needed if .mtx input file)
-d,--destination Specify the destination path (default: pwd)
-n,--n_threads Specify the number of threads to be used (default: 4)
-e,--extended_output Option to print extended output (default: false, choice: false,0,true,1)
-vmin,--variance_min Minimal value of variance in log transcription quotient (default: 0.001)
-vmax,--variance_max Maximal value of variance in log transcription quotient (default: 50)
-nbin,--number_of_bins Number of bins for the variance in log transcription quotient (default: 160)
-no_norm,--no_cell_size_normalization Option to skip cell size normalization (default: false, choice: false,0,true,1)
git clone https://github.com/jmbreda/Sanity.git
Install OpenMP library
On Linux
If not already installed (Check with ldconfig -p | grep libgomp
, no output if not installed), do
sudo apt-get update
sudo apt-get install libgomp1
On mac OS using macports
Install the gcc9
package
port install gcc9
Change the first line of src/Makefile
from CC=g++
to CC=g++-mp-9
On mac OS using brew
Install the gcc9
package
brew install gcc9
Change the first line of src/Makefile
from CC=g++
to CC=g++-9
cd Sanity/src
make
Sanity/bin/Sanity
Sanity/bin/Sanity_macOS
Compute cell-cell distances from Sanity output files. Needs extended outputs of Sanity (-e 1
option).
The output folder of the Sanity run, specifiied with the -d
option in Sanity ('path/to/folder'
)
(optional) The gene signal to noise ratio used as gene cut-off (double, default: 1.0
)
(optional) Compute distances with or without errorbars (boolean, default: 1
or true
)
(optional) Number of threads (integer, default: 4
)
Cell-cell distance: (Nc(Nc-1)/2) vector of cell to cell distances dist(celli,cellj), i=1,...,Nc-1, j=i+1,...,Nc, with Nc the number of cells. |
---|
dist(cell1,cell2) |
dist(cell1,cell3) |
dist(cell1,cell4) |
... |
dist(cellNc-2,cellNc-1) |
dist(cellNc-2,cellNc) |
dist(cellNc-1,cellNc) |
located in the Sanity output folder (specified with -f
option), named cell_cell_distance_[...].txt
, depending on the -err
and -s2n
options.
./Sanity_distance <option(s)> SOURCES
Options:
-h,--help Show this help message
-v,--version Show the current version
-f,--folder Specify the input folder with extended output from Sanity
-s2n,--signal_to_noise_cutoff Minimal signal/noise of genes to include in the distance calculation (default: 1.0)
-err,--with_error_bars Compute cell-cell distance taking the errobar epsilon into account (default: true)
-n,--n_threads Specify the number of threads to be used (default: 4)
Same dependencies as Sanity (see above).
Move to the source code directory and compile.
cd Sanity/src
make Sanity_distance
The binary file is located in
Sanity/bin/Sanity_distance
For any questions or assistance regarding Sanity, please post your question in the issues section.