juliaolivieri / SpliZ_pipeline

Code to calculate the Splicing Z Score (SpliZ) for single cell RNA-seq splicing analysis
GNU General Public License v2.0
29 stars 9 forks source link

SpliZ Pipeline

Pipeline

This repository contains code to perform the analyses in the paper "The SpliZ generalizes “Percent Spliced In” to reveal regulated splicing at single-cell resolution" (Olivieri, Dehghannasiri, and Salzman 2021).

This pipeline takes the output from SICILIAN and returns the SpliZ for each gene and cell, as well as analyses of differential alternative splicing.

Pipeline

Installation and setup

Clone this repository: $ git clone https://github.com/juliaolivieri/SpliZ_pipeline.git

$ cd SpliZ_pipeline/

Ensure that conda is working on your system. Then set up the conda environment from the environment.yml file:

$ conda env create --name spliz_env --file=environment.yml

and activate it:

$ source activate spliz_env

If this activation step doesn't work, try running conda env list and looking for the path that ends with spliz_env. Then run source activate <full path>.

This whole process should take less than 5 minutes on a normal computer.

Running the pipeline on test data

Use the following command to run the pipeline on the small test dataset (labeled test in the data folder):

snakemake -p --config datasets="test" --restart-times 0

This should take less than 5 minutes to run on a local computer with at least 3 Gb free space.

After the pipeline has completed, you can check your results by comparing the file scripts/output/final_summary/summary_test_compartment-tissue_100_S_0.1_z_0.0_b_5.tsv with the sample output file test_pvals_compartment-tissue_100_S_0.1_z_0.0_b_5.tsv in the main directory.

Downloading data from paper

You will need to place the following files in the "data" directory, accessible on figshare:

Running the pipeline

Names of datasets to run on are specified in the config.yaml file. To run, use snakemake -p. To run on different datasets, either change the values in the config.yaml file, or override them at the command line: snakemake -p --config datasets="my_data_name". You can run snakemake -np first to see what jobs will be run. Each job automatically re-submits itself two times if it fails, so if you want to run without these resubmissions you can run snakemake -p --restart-times 0.

The terminal window you submit from will not be available again until after the full pipeline runs. You can use tmux to subset your termianl pane so that snakemake is only running in one box (this also allows you to detatch the session so it continues running even when terminal isn't open). For the tmux approach you will have to always log in to the same node so you can reconnect to the same session.

The pipeline should take around one hour to run on the full dataset.

To set up snakemake to run on slurm, you can follow the directions here: https://github.com/Snakemake-Profiles/slurm. All of the time and memory requirements for the SpliZ pipeline are specified in the script itself, so you don't need to change these variables if you're only running this pipeline.

Input file format

This pipeline works with the "class input file" output of the SICILIAN pipeline. To run the pipeline without running SICILIAN first, your data must be a tsv or parquet file with one row per cell+junction with the following columns:

An example input file is given in data/test.tsv.

Output files

SpliZ values

The SpliZ values for the dataset can be found in this output file:

This output file has one line per cell per gene with enough spliced reads to calculate a SpliZ value. The column values are:

Differential SpliZ analysis

Results of the differential SpliZ analysis can be found in this file:

There is one row per gene per group and cell type. The columns of the file are defined as follows:

SpliZsites

A separate file is created based on each of the first three eigenvectors:

Each of these files contains the following columns:

Software dependencies

These are also found in the environment.yml file.

    - python=3.6.7
    - pandas=1.0.4
    - tqdm=4.46.0
    - numpy=1.18.4
    - snakemake-minimal=5.4.5=py_0
    - pyarrow=0.15.1
    - r-base=4.0.2
    - r-data.table=1.14.0
    - r-rfast=2.0.3
    - scipy=1.4.1
    - statsmodels=0.11.1

References

Olivieri, Dehghannasiri, and Salzman. "The SpliZ generalizes “Percent Spliced In” to reveal regulated splicing at single-cell resolution." bioRxiv. (2020) https://www.biorxiv.org/content/10.1101/2020.11.10.377572v2.

Dehghannasiri, Olivieri, and Salzman. "Specific splice junction detection in single cells with SICILIAN," bioRxiv. (2020) https://www.biorxiv.org/content/10.1101/2020.04.14.041905v1.