refresh-bio / SPLASH

57 stars 6 forks source link

SPLASH 2

Introduction

SPLASH is an unsupervised and reference-free unifying framework to discover regulated sequence variation through statistical analysis of k-mer composition in both DNA and RNA sequence. SPLASH leverages our observation that detecting sample-regulated sequence variation, such as alternative splicing, RNA editing, gene fusions, V(D)J, transposable element mobilization, allele-specific splicing, genetic variation in a population, and many other regulated events can be unified–in theory and in practice. This is achieved with a simple model, SPLASH, that analyzes k-mer composition of raw sequencing reads (Chaung et al. 2022). SPLASH finds constant sequences (anchors) that are followed by a set of sequences (targets) with sample-specific target variation and provides valid p-values. SPLASH is reference-free, sidestepping the computational challenges associated with alignment and making it significantly faster and more efficient than alignment, and enabling discovery and statistical precision not currently available, even from pseudo-alignment.

The first version of SPLASH pipeline proved its usefulness. It was implemented mainly in Python with the use of NextFlow. Here we provide a new and improved implementation based in C++ and Python (Kokot et al. 2023). This new version is much more efficient and allows for the analysis of datasets >1TB size in hours on a workstation or even a laptop.

How does it work

A key concept of SPLASH is the analysis of composition of pairs of substrings anchortarget across many samples. The substrings can be adjacent in reads or can be separated by a gap.

The image below presents the SPLASH pipeline on a high-level. image

Installation

Precompiled binaries

The easiest way to get SPLASH is to use precompiled release. To get version 2.3.0 and run the example, is is sufficient to do the following:

curl -L https://github.com/refresh-bio/SPLASH/releases/download/v2.3.0/splash-2.3.0.linux.x64.tar.gz | tar xz
cd example
./run-example.sh

Docker container

It is possible to run splash using the docker container available through GitHub packages (https://github.com/orgs/refresh-bio/packages/container/package/splash).

To pull the image run:

sudo docker pull ghcr.io/refresh-bio/splash:2.3.0 # replace version number if needed

Example of how to run splash with docker. Prerequisites:

Singularity container

Sometimes, sudo is unavailable (for example, on HPC). In such a case docker container may be transformed into a singularity container (https://docs.sylabs.io/guides/latest/user-guide/).

To pull the singularity version of splash use:

singularity pull docker://ghcr.io/refresh-bio/splash:2.3.0 # replace version as needed

This will result in a splash_2.3.0.sif file created in the current directory. To execute splash using this file run:

./splash_2.3.0.sif splash input.txt

It is also possible to run without pulling first:

singularity run docker://ghcr.io/refresh-bio/splash:2.3.0 splash input.txt

It may be necessary to configure bind for singularity (-B parameter) depending on your configuration.

Compile from sources

SPLASH is implemented as several applications written in the C++ programming language and a Python wrapper to run the whole pipeline. Currently, the software may be used only under Linux. A compiler supporting C++17 is needed to compile the code. You can use the following snippet to compile SPLASH.

git clone https://github.com/refresh-bio/splash
cd splash
make -j

After this splash may be run as follows:

bin/splash # this will print help

Install the compiled splash

The simplest way of installing splash after compilation is to run:

sudo make install

If the PREFIX environment variable is not defined the above will install splash in /usr/local/bin which requires sudo. It is possible to override the install location so that sudo is not required. For example, to install in the user's home directory, one may run:

make install PREFIX=~/splash

or

export PREFIX=~/splash
make install

Uninstall splash

To uninstall splash one may run:

sudo make uninstall

The same PREFIX should be used like for installation (depending on this sudo may be not required).

Running the example

To verify the installation on a small example, one may perform the following:

cd example
./download.py #download exemplary data
splash input.txt #run the pipeline with default parameters

The result consists of two TSV files, namely,

  1. result.after_correction.all_anchors.tsv
  2. result.after_correction.scores.tsv The first file contains all unfiltered anchors found by the pipeline. The second file contains only anchors whose corrected p-value is below 0.05.

Inputs

There are a lot of parameters allowing to customize the pipeline. They can be grouped into several categories. The parameters will be displayed when running splash without parameters (or with --help).

Input reads parameters:

Anchor filtering parameters:

Statistical significance parameters:

Reporting parameters:

Directory parameters:

High Performance Computing parameters:

Optimization parameters:

10X SC-RNAseq parameters:

Supervised testing parameters:

Understanding the output

There are following columns in the resulting tsv files Column Meaning Notes
anchor anchor
pval_opt p-value from alternating maximization number of iterations and other parameters can be configured with switches. Optimization formulation and statistical exposition in (Baharav et al. 2023).
effect_size_bin measure of anchor's effect size bounded between [0,1], indicates how well the data is divided between 2 groups of columns. 0 indicates no different between groups, 1 indicates two groups of columns with disjoint row supports (Baharav et al. 2023).
pval_asymp_opt asymptotically valid p-value based on alternating maximization compute optimizing c and f used for pval_opt, and evaluate approximate p-value given by asymptotic normality (Baharav et al. 2023).
pval_base p-value based on random partitionings Compute base p-value (using num_rand_cf random c,f), Bonferroni correction to yield valid p-value.
M anchor count number of anchor occurences in the input (= the sum of elements in contingency table)
anch_uniqTargs number of unique targets for anchor (=number of rows in contingency table)
number_nonzero_samples number of samples containing anchor (=number of columns in contingency table)
target_entropy measure of diversity of target distribution entropy of empirical target distribution. Compute the empirical target distribution $(p_1,...,p_T)$ over the $T$ observed targets by summing counts across all samples and normalizing, and output the entropy of this empirical target distribution $H = -\sum_i p_i \log_2 (p_i)$.
avg_hamming_distance_max_target average hamming distance to most abundant target for each observed target (count), compute the hamming distance between it and the most abundant target (highest counts), average this over all targets. Useful in identifying / filtering SNPs (this measure will be low).
avg_hamming_distance_all_pairs average hamming distance between all pairs of targets For the M counts in the matrix, compute the hamming distance between all pairs of targets, output the average.
avg_edit_distance_max_target average Levenshtein distance to most abundant target Same as avg_hamming_distance_max_target, but Levenshtein distance as opposed to hamming distance. If an anchor has high avg_hamming_distance_max_target but low avg_edit_distance_max_target, this is indicative that the discrepancy between the top targets is due to an insertion or deletion. If the two quantities are similar, then this discrepancy is most likely due to a SNP.
avg_edit_distance_all_pairs average Levenshtein distance between all pairs of targets Same as avg_hamming_distance_all_pairs but with Levenshtein distance as opposed to hamming distance.
pval_opt_corrected Benjamini-Yekutieli corrected pval_opt only present in *.scores.tsv file

Input format

In the example, the input.txt file was used. This file defines the set of input samples for the algorithm. Its format is one sample per line. Each line should contain the name of a sample and (after space) path to the input sample.

Important note: if a relative path is specified it is relative to the current working directory, not the directory of input.txt.

Example Applications

Given the broad applications of SPLASH, in analysis_notebooks folder, we provide the analysis downstream of SPLASH on how to interpret the results for a few major applications in which SPLASH has been applied so far. We should note that SPLASH is quite general and continues to be applied in new genomics problems.

Splicing analysis for RNA-Seq

The notebook (analysis_notebooks/SPLASH_splicing_analysis_notebook.Rmd) provides detailed step-by-step instructions on how SPLASH results can be interpreted for an alternative splicing analysis. The reference genome in this example is human but it could be replaced with any other organism with any quality of transcriptome annotation.

Additional output

Most frequent targets per each anchor

By default SPLASH will store 2 most frequent targets per each anchor in the resulting TSV files. This should be sufficient for splicing, but for RNA editing/mismatches 4 may be a better choice. It may be set with --n_most_freq_targets switch. If the number of targets for a given anchor is lower than specified value there will be a single - for each missing target.

SATC format

SPLASH stores intermediate and optional output files in SATC format (Sample Anchor Target Count).

Sample representation

The unique id is assigned to each sample. The ids are consecutive numbers starting with 0. The first sample from the input file gets id 0, the second one gets 1, and so on. By default SPLASH will store the mapping in sample_name_to_id.mapping.txt file, but this may be redefined with --sample_name_to_id parameter. This mapping file may be useful to access the data stored in SATC format.

Output sample, anchor, target, count

Textual

When --dump_sample_anchor_target_count_txt switch is used there will be an additional output directory (named result_dump by default, but the results part may be redefined with --outname_prefix switch). This directory will contain a number of files (equal to the number of bins, default 128, may be redefined with --n_bins switch). The extension of these files is .satc.dump. Each line of these files is a tab-separated list of . This is the easiest way to be able to reproduce contingency tables used during computation. Each file contains some number of anchors, but it is assured that a specific anchor is present in a single file. Since these text files could be large, it may be proficient to use binary (SATC) files instead.

Binary

When --dump_sample_anchor_target_count_binary switch is used there will be an additional output directory (named result_satc by default, but the results part may be redefined with --outname_prefix switch). This directory will contain a number of files (equal to the number of bins, default 128, may be redefined with --n_bins switch). The extension of these files is .satc. These are binary files in SATC format. Their content may be converted to textual representation with satc_dump program (part of the SPLASH package). Each file contains some number of anchors, but it is assured that a specific anchor is present in a single file. Since these text files could be large, it may be proficient to use binary (SATC) files instead, especially if one wants to investigate only some of all anchors.

satc_dump

To convert SATC files into textual representation one may use satc_dump program. The simples usage is

 satc_dump input.satc output.satc.dump

There are also additional parameters that may be useful, namely:

The generated output is 1 file per anchor, each file containing 4 subplots. At bottom left is the raw I x J contingency table, where each column is a sample, and each row represents a target (low abundance targets not shown). Each sample’s target counts are normalized by $n_j$ and plotted. At bottom right, the targets for the contingency table are displayed in a I x k table (corresponding to the rows of the contingency table, visually aligned). Each target of length k is shown, with base pairs colored as in the below colorbar. The target sequence for each row is displayed on the right. At top left, the sample metadata is shown in a 1 x J table. Each entry corresponds to the column in the contingency table directly below it. The sample metadata is shown in a color bar below. At top right, the column counts ($n_j$) are shown in a 1 x J table, where the colorbar below provides the scale. Again, columns are sorted as the contingency table.

The script c_analysis.ipynb shows how the saved c vectors can be loaded in for further analysis. --dump_Cjs must be enabled for this.

Biological interpretation and classification of anchors

To facilitate downstream analysis of anchors, we provide a postprocessing script SPLASH_extendor_classification.R, that can be run on the anchors file generated from the SPLASH run to classify anchors to biologically meaningful events such as alternative splicing, and base pair changes. SPLASH_extendor_classification.R needs the following inputs:

The script will generate a file classified_anchors.tsv in the same directory used for SPLASH run, containing significant anchors along with their biological classification and alignment information.

Building index and annotation files needed for running classification script

For running the classification script for a given reference genome/transcriptome you first need to obtain a fasta file for the reference genome and a gtf file for the transcriptome annotation. You then need to do the following 3 steps (note that all index/annotation files from these 3 steps should be generated from the same fasta and gtf file):

Downloading pre-built annotation files for human and mouse genomes:

The human files were built for both T2T assembly and GRCh38 assembly. The mouse files were built based on mm39 assembly. The annotation files can be downloaded using the following links:

References

Marek Kokot, Roozbeh Dehghannasiri, Tavor Baharav, Julia Salzman, and Sebastian Deorowicz. SPLASH2 provides ultra-efficient, scalable, and unsupervised discovery on raw sequencing reads bioRxiv (2023)

Kaitlin Chaung, Tavor Baharav, Ivan Zheludev, Julia Salzman. A statistical, reference-free algorithm subsumes myriad problems in genome science and enables novel discovery, bioRxiv (2022)

Tavor Baharav, David Tse, and Julia Salzman. An Interpretable, Finite Sample Valid Alternative to Pearson’s X2 for Scientific Discovery, bioRxiv (2023)