mehdiborji / nanoranger

simplified cellranger for long-read data
MIT License
15 stars 3 forks source link
10x-genomics bioinformatics bioinformatics-pipeline cellranger computational-genomics computational-science linux minimap2 mixcr nanoranger pacbio python3 sc-rna-seq single-cell-rna-seq slide-seq spatial-transcriptomics tcr-repertoire tcr-seq vdj-recombination

nanoranger

nanoranger is a processing tool for long-read single-cell transcriptomics as described in our Nature Communications paper, and spatial transcriptomics as described in our Immunity paper.

Workflow

The input data can be obtained through sequencing of 10x Genomics whole-transcriptome cDNA libraries or amplicons obtained through targeted amplification, with Oxford Nanopore Technologies (ONT) or Pacific Biosciences devices. The schematic of our workflow is demonstrated below.

schema

If you have a question about the software, or have any suggestions or ideas for new features or collaborations, feel free to create an issue here on GitHub, or write an email to mborji@broadinstitute.org.

Background

Two of the main challenges of ONT data analysis for single-cell applications have been (i) higher sequencing error compared to Illumina data and (ii) the variable location of cell barcodes and molecular identifiers (UMI) within each sequenced transcript.

To overcome these challenges nanoranger introduces two innovations:

There are different quantification 'modes' available for different libraries structures and tasks and the transcriptome reference can be modified accordingly. For whole transcriptome gene expression analysis a GENCODE transcriptome reference can be used . For 5' immune profiling this can be reduced to a reference of V transcripts and similarly for 3' immune profiling this can be a reference of C transcripts. If a set of targets is used for enrichment from cDNA, to speed up analysis one can only use a reference for those transcripts that are expected to be present.

nanoranger has been primarily tested on targeted libraries generated using 10X 5' Chromium and slide-seq 3' platforms. It can be used for immune profiling and genotyping from other library types with minimal modifications.

Further developments for generating count matrices for whole transcriptome libraries as well as addition of other chemistry types are currently underway.

Software Dependencies

This tool has been tested on Python 3.7.10 under Centos and Ubuntu systems.

The following programs are also assumed to be in path when running the tool. Please refer to the provided link for each to install them prior to start of your data analysis using this tool. Alternatively they are available as bioconda packages.

STAR is used for barcode correction against a set of known barcodes. By certain input parameter changes we use STAR in a Smith-Waterman-like mode.

minimap2 is used for initial alignment of raw nanopore reads to a transcriptome and (subsequently based on operation mode) alignment to a genome.

SAMtools is used for sorting and indexing BAM files

pigz is used for compressing output and intermediate fasta and fastq files.

MiXCR is used for VDJ alignment and clonotype extraction. We have strictly used MiXCR v3 in validating and benchmarking the results against Illumina-based data. Latest versions of MiXCR are not fully tested with our workflow and seem not be compatible out of the box without tunning parameters.

SeqKit is used for splitting input fastq files in case of very large libraries or libraries prepared with cDNA concatenation. Deconcatenation speed-up is achieved by parallel processing of splitted input files. To enable this step set the optional boolean flag --split.

Download and Install

git clone https://github.com/mehdiborji/nanoranger.git
cd nanoranger
chmod -R +x *
pip install -r requirements.txt

Sample Input Commands For Different Modes

The pipeline supports different chemistries through --mode flag

3pXCR_slideseq

python ~/nanoranger/pipeline.py \
        --c 8 \
        --i ~/nanoranger/sample_fastq/slideseq_XCR.fastq.gz \
        --o XCR \
        --e Puck_220509_18 \
        --m 3pXCR_slideseq \
        --b ~/nanoranger/data/slideseq.matched.barcodes.tsv.gz \
        --t ~/nanoranger/data/XR_C_mouse.fa \
        --x mmu

5p10XTCR

python ~/nanoranger/pipeline.py \
        --c 8 \
        --i ~/nanoranger/sample_fastq/TCR3.fastq.gz \
        --o TCR \
        --e TCR \
        --m 5p10XTCR \
        --t ~/nanoranger/data/TR_V_human.fa \
        --x hsa

5p10XGEX

python ~/nanoranger/pipeline.py \
        --c 8 \
        --i ~/nanoranger/sample_fastq/K562_Kasumi1_BCRABL1_RUNX1_RUNX1T1.fastq.gz \
        --o K562_Kasumi1 \
        --e fusion \
        --m 5p10XGEX \
        --t ~/nanoranger/data/RUNX1_RUNX1T1_ABL1_BCR.fa \
        --g ~/nanoranger/data/RUNX1_RUNX1T1_ABL1_BCR.fa

Downstream of this process, we may like to extract the transcript-BC-UMIs associated with each read and extract the meaningful fusions after removal of potential chimeras and events with few supporting reads. This can be accomplished by running the following script on the final BAM file:

python ~/nanoranger/scripts/downstream/fusion_gene.py --b fusion_genome_tagged.bam --o fusion_reads.csv

For RUNX1_RUNX1T1 fusion, we use a primer for RUNX1T1 transcript close to the fusion site. Reads with a flanking barcode that align to RUNX1 will be fusion reads. Such reads will have another (supplementary or even primary) alignment to RUNX1T1; however, the flanking region of such alignments will not contain any barcodes and will be automatically dropped in the processing. Reads with flanking barcode that align to RUNX1T1 will be wild-type reads.

Coming Soon!

3p10XGEX

Coming Soon!

3p10XTCR

Coming Soon!

Downstream Analysis