mckennalab / SingleCellLineage

Updated scripts and pipelines for processing GESTALT data at single-cell resolution
19 stars 8 forks source link

Single-cell GESTALT pipeline

This pipeline processes 2nd-generation sequencing of GESTALT lineage tracing barcodes. You can either setup the tool manually (following the directions in the Docker file), or use the Docker installation option. The pipeline merges reads by UMI, aligned the resulting consensus to a reference file, and aggregates results into plots describing the overall editing, as well as trees describing the relative lineage relationships within your data. This code is a moving target, and may contain buggy implementations or changing code.

Processing data using the GESTALT pipeline

Before you begin

Things you need to have:

Then download the docker container:

docker pull aaronmck/single_cell_gestalt:latest

Run and connect to the container, remapping port 80 of the container to port 8080 on your local machine:

 docker run -it -p 8080:80 aaronmck/single_cell_gestalt:latest /bin/bash

Setup an individual run

Again, you can save this to a tab-delimited file on the same filesystem you plan to run the pipeline on.

Setup a run script

The run script specifies what parameters should be used with the tearsheet above, and executes the full processing pipeline. The idea is to separate what's run (the samples in the tearsheet above) from how they're run (this script). Generally this is run as a bash or shell script that looks like the following. Items with brackets <like_so> are paths that are specific to your installation:

java -Xmx4g \
    -jar <path_to_queue.jar> \
    -S <pipeline_path_plus>/CRISPR_analysis_PE_V2.scala  \
    -i <your_tearsheet_path>/crispr_tearsheet_04_16_2020.txt \
    --aggLocation <some_path_here>/data/pipeline_output/ \
    --expName 2020_04_16_new_libs \
    --web <where_to_put_webfiles> \
    -s <path_to_github_clone_on_filesystem>/SingleCellLineage/scripts/ \
    -b <path_to_external_tools>/dartfs-hpc/rc/lab/M/McKennaLab/resources/tools/bin/ \
    -run

We generally run java with ~4g of memory; this is overkill but generally not a big issue on large compute environments. The -jar parameter is the path to the Queue execution manager, which runs the -S script for us. This script takes the -i tearsheet information, parses it out, and creates a number of compute jobs to transform the raw reads into GESTALT barcode calls. You also need to specify the aggregate data location with --aggLocation parameter; this is where the pipeline will put a couple files about the run as a whole (across samples). The --expName parameter gives a name to the experiment as a whole (used for web output and run statistics), and the --web directory is where the pipeline will dump web-visable plots for later analysis. The -s and -b parameters describe the location of tools we'll need to run the pipeline, and finally -run tells the script to actually start jobs (without -run it will just list the commands it would run).

Run the example script

If you're running a locally installed version of the pipeline, you just need to run the script we created above:

sh <example_shell_script.sh>

Otherwise If you're running from within a docker container, run the example script:

sh /app/sc_GESTALT/tear_sheet_examples/run_crispr_pipeline.sh

More Docker information

You can adjust the input files by changing filenames in the tearsheet: app/sc_GESTALT/tear_sheet_examples/basic_example.txt. To mount your data on a local disk to a location within the docker filesystem, use the -v option:

docker run -m 8g -it -p 8080:80 -v /Users/aaronmck/Desktop/gel_images/:/my_data aaronmck/single_cell_gestalt:SC_GESTALT /bin/bash

You can then rerun the example script or create your own:

sh /app/sc_GESTALT/tear_sheet_examples/run_crispr_pipeline.sh

Output

A run of the GESTALT pipeline creates a number of files in the output directory of each sample. This location is specified in the input sample spreadsheet, in the output.dir column. See our example spreadsheet used in the script here:

/app/sc_GESTALT/tear_sheet_examples/basic_example.txt

In that file the base output location is set to /app/data/, which should have a sub-directory for the dome_4_1X.

Within the example run output directory there are a couple important output files:

The pipeline also generates visualization output: plots for the editing pattern over the target. The default is to put this into the /var/www/html/ directory on the Docker container, but can be set with the --web parameter to the pipeline. If you're using docker and you've remapped your ports above (using -p 8080:80) you should be able to visualize the output by opening localhost:8080 when run locally.

Within the the base output directory there will be a directory for each run. In the example case, our run is called testdata, and there will be a directory matching this name. In that, our sample is called dome_4_1X. This folder contains the html output to visualize editing patterns over your sample. If you've setup your port (the -p 8080:80 option when running Docker) you can open a browser and view your output here:

http://localhost:8080/testdata/dome_4_1X/read_editing_mutlihistogram.html

Each sample processed should have a read_editing_mutlihistogram.html html file in it's base output directory.

10X example run

A question that often comes up is how do I setup a 10X sequencing barcode run? I've generated a bit of fake lineage data from our fish barcode to demonstrate the process. First run the docker container (add optional disk mounting commands to your run as seen above, and be sure to update the container, as there's been changes):

 docker run -it -p 8080:80 aaronmck/single_cell_gestalt:latest /bin/bash

Now we'll make a directory for our run data:

cd /app/
mkdir my_test_run
cd my_test_run/
mkdir data
mkdir data/raw

Download the test reads:

wget -O data/raw/read1.fq.gz https://downloads.mckennalab.org/public/read1.fq.gz
wget -O data/raw/read2.fq.gz https://downloads.mckennalab.org/public/read2.fq.gz

And download the reference files (making some folders we'll need as well):

mkdir data/reference/
mkdir data/pipeline_output
wget -O data/reference/tol2.fa https://downloads.mckennalab.org/public/tol2.fa 
wget -O data/reference/tol2.fa.cutSites https://downloads.mckennalab.org/public/tol2.fa.cutSites 
wget -O data/reference/tol2.fa.primers https://downloads.mckennalab.org/public/tol2.fa.primers 

The first reference file describes the sequence itself, the second file is a tab-delimited file describing where the targets and their cut-sites (17bp into the target) are, and lastly a file describing any primers we expect to be on either end of the construct. We can tell the pipeline to use neither, both or just one of these primers to filter out unintended PCR products.

Now we'll need to setup a tearsheet to tell the pipeline all about our data. We can download the test tearsheet, which uses the exact directories we've setup above. I've preconfigured one you can download:

wget -O data/tol2_simulated_data_tear_sheet.txt https://downloads.mckennalab.org/public/tol2_simulated_data_tear_sheet.txt

You can browse the columns in the file with standard unix commands, or download it yourself and open it in Excel. Generally for each sample there's a line in this file. The first column lists the sample name, the second indicates if it has a UMI sequence (10X does, so set this to TRUE), the third is where to find the reference file, the forth is where to put the output of the pipeline, and then the rest of the columns describe the compressed FASTQ files and which indexes to use. In this case we don't files with indexes mixed together, so we can set the barcode FASTQ columns to NA and the barcode columns to ALL, which tells the pipeline to just use all the reads in the input files for that sample. It's important to note: for 10X data the sequence data, often read2, needs to be in the first FASTQ slot. The combined 10X barcode/UMI should be in the second FASTQ slot.

Lastly we want to setup a script to run everything; we can pull an example I setup. Here we describe a number of parameters that are important to a 10X run:

wget -O test_run.sh https://downloads.mckennalab.org/public/test_run.sh
bash test_run.sh

This should kick-off a process that generates a large amount of output, terminating with something that looks like:

INFO  00:40:51,659 QCommandLine - Script completed successfully with 5 total jobs

The main output file you're interested in for each of your samples is the stats file, located in the sample specific output folder. In this case it's located in _data/pipeline_output/simulated_sample_1/simulated_sample1.stats for our one sample. This data was derived from ~8200 simulated cell lineages, which you can browse in that file. When you go to match it up against your 10X barcodes, one of the most important parts is the read name. For example, the line starting with:

NS500488:132:H7JCTAFXX:1:11101:8025:2_TATTAGATCTCGCAGCCCTCGATAGACAFWD_18_1_ConsensusFWD__

This line describes a collapsed read UMI, collapsed down from 18 reads, with the combined barcode TATTAGATCTCGCAGCCCTCGATAGACA. The first 16 bases of which are the 10X barcode and the remaining 12 bases are the UMI.