pachterlab / sircel

Identify cell barcodes from single-cell genomics sequencing experiments
MIT License
41 stars 14 forks source link

Readme

Akshay Tambe
Pachter and Doudna groups
UC Berkeley

Summary

sircel (pronounced "circle") separates reads in a fastq file based on barcode sequences that occur at known positions of reads. This is an essential first step in analyzing single-cell genomics data from experiments such as Drop-Seq. Barcode sequences often contain deletion and/or mismatch errors that arise during barcode synthesis and sequencing, and we have designed our barcode recovery approach with these issues in mind. In addition to identifying barcodes in an unbiased manner, sircel also quantifies their abundances.

sircel is well-suited for Drop-Seq, in which barcode deletion errors are prevalent. In order to identify barcodes in a deletion-robust manner, sircel begins by counting k-mers in circularized barcodes extracted from the reads. Circularization of the barcode portion ensures that a mismatch-containing barcode will share at least one k-mer with the true barcode sequence. This is not the case when using 'linear' barcode sequences for some larger values of k.

To address insertion and deletion errors, barcodes are also circularized after either appending or deleting one sequenced base. In the case of dropseq barcoded beadas, a deletion will result in the last sequenced position of the barcode actually arising from the first base of the UMI. Conversely an insertion will result in the first base (by position) of the UMI actually arising from the last base of the barcode. As each read might potentially contain an unknown mismatch / insertion / deltion error, each read is converted to three circularized variants. Kmers are then counted from these circularized barcodes, and a De Bruijn graph is constructed where a node represents by a kmer, and edges represent two kmers that were adjacent in at least one read. Edges are weighted by how many reads connect those kmers.

In this graph, a barcode will appear as a circular path of fixed length through this graph. We define the weight of such a path to be that of the lowest-weight edge within this path. We can identify such circular paths in this graph, and such paths represent possible consensus barcode sequences. It should be noted here that reads with either 0 or 1 error can contribute to these weighted paths; reads with 2 or more errors typically do not.

This path weight can be interpreted as the number of reads that support a given barcode at its lowest-confidence kmer. As such we identify the true consensus barcodes by selecting the paths with highest weight. This process is typically straightforward, as we have observed that the cumulative distribution of path weights contains an inflection point corresponding to true paths. This result is seen in a number of simulated and real datasets.

We assign each read in our dataset to a consensus barcode based on the number of kmers the read shares with the consensus barcode. When computing kmers for this assignment, the value of k can differ from the value used to produce the barcode De Bruijn graph. This allows us to assign reads which might include 2 or more errors.

Installation

Run the setup py script below. In order to use kallisto to quantify single-cell expression levels after sircel identifies cell barcodes, and kallisto is not on the PATH the --kallisto argument must point to the kallisto binary (requires kallisto version >= 0.43.0).

python3 setup.py install --kallisto PATH_TO_KALLISTO --install-scripts=./sircel/

The sircel executable should be in the directory ./sircel/sircel. After installation, you can run sircel via sircel.

Requirements

python3
numpy
scipy
scikit-learn (for TCCs)
    Optional. Not needed if only splitting reads by barcode (not quantifying single-cell expression levels)
kallisto (0.43.0 or higher)
    Optional. Not needed if only splitting reads by barcode (not quantifying single-cell expression levels)

Commandline arguments

--help              Prints help message

--dropseq           Use barcode / umi positions from Macosko et al.
--10xgenomics           Use barcodes / umi coordinates from 10xGenomics version2 chemistry
--barcodes          Fastq.gz file from bead barcodes (Reads 1 for dropseq, reads 1 for 10xGenomics)
--reads             Fastq.gz file from RNA-seq / 3' sequence tags (Reads 2 for dropseq, reads 2 for 10xGenomics)
--umis              UMIs file. (Index I1 file, only required for 10xGenomics )
--output_dir            Directory where output files are written
--kmer_size         Size of k-mer to use for building the barcode graph
--depth             Search depth at each node [Default: 4]
--breadth           Search breadth through the graph [Default: 1000]
--threads           Number of threads to use [Default: 1]

Optional arguments

--min_dist      Minimum Hamming distance between barcodes
--kallisto_idx      Path to pre-computed kallisto index.
                If this is provided the script will quantify single-cell expression profiles using kallisto
--barcode_start     First position of cell barcode within read in --barcode file.
                Not needed if --dropseq or --10xgenomics arguments are provided
--barcode_end       Last position of cell barcode within read in --barcode file.
                Not needed if --dropseq or --10xgenomics arguments are provided
--umi_start     First position of unique molecule ID within read in --barcode file.
                Not needed if --dropseq or --10xgenomics arguments are provided
--umi_end       Last position of unique molecule ID within read in --barcode file.
                Not needed if --dropseq or --10xgenomics arguments are provided
--num_cells     Estimated number of cells. Not required

Note about barcode and UMI position indexing Barcode and UMI positions within a read follow the same indexing convension as python strings. For example, the string BARCODEUMI would have coordinates:

barcode_start       0
barcode_end         7
umi_start           7
umi_end             10

This follows from the following python snippet:

>>> seq = 'BARCODEUMI'
>>> seq[0:7]
'BARCODE'
>>> seq[7:10]
'UMI'

Use example

1 million reads from from Macosko 2015

This tutorial will run this software on an example dataset from Macosko et al (SRR1873277). According to the authors' software we expect ~570 single cells from this dataset. To speed things up we only use the first 1 million reads in the file (rather than ~400 million).Note that if you are not using kallisto to quantify expression levels, omit the --kallisto_idx option. This command will run a complete singe-cell RNA-seq analysis, going from raw, un-split reads to single-cell expression profiles. Note that the human / mouse transcriptome fasta can be found at http://bio.math.berkeley.edu/kallisto/transcriptomes/. Build a kallisto index following instrutuions https://pachterlab.github.io/kallisto/starting

sircel \
    --threads 32 \
    --dropseq \
    --output_dir example \
    --reads example/SRR1873277_1m_r2.fastq.gz \
    --barcodes example/SRR1873277_1m_r1.fastq.gz \
    --kallisto_idx kallisto_idx/hgmm_kallisto.idx

The following output files will be produced in the example/ directory:

    all_paths.txt
        Tab-delimited file containing all circular paths (putative barcodes) found in the graph 
    merged_paths.txt
        Tab-delimited file containing circular paths after Hamming correction   
    paths_plotted.png
        Plots / histograms of circular path weights
    fits.txt
        Tab-delimited file containing Gaussian fits for plots above
    kallisto/
        Folder containing kallisto and TCC output files
        See Pachterlab scRNA-seq TCC pipeline for more:
            https://github.com/pachterlab/scRNA-Seq-TCC-prep
    reads_split/
        Folder containing fastq.gz files split by read.
        Also contains umi.txt and batch.txt files for kalisto single-cell
    run_log.txt
    run_outputs.json

The outputs from this command can be visualized with an included ipython notebook. Simply point the notebook to the appropriate run output file [codeblock 3] to re-compute the plots.