Vulture: Scalable microbial calling pipeline on AWS Cloud

Introduction

Vulture is a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing data, enabling the meta-analysis of the single-cell host-microbial studies from the AWS Open Data and other public domain. We named our pipeline Vulture because Vultures are birds that fly the highest above the 'cloud' and as a scavenger can defend themselves from harmful pathogens.

Run Vulture on the AWS cloud

For the scalable Vulture usage on the AWS cloud, please kindly refer to our hands-on tutorial page at: Vulture tutorial on the cloud

Run Vulture on local machines

Map 10x scRNA-seq reads to the viral (and microbial) host reference set using STARsolo, CellRanger, Kallisto|bustools, or Salmon|Alevin.

Requirements

Input data

10x Chromium scRNA-seq reads
Software dependencies
DropletUtils >= v1.10.2
STAR >= v2.7.9a (default) or
cellranger >= 6.0.0 or
Kallisto|bustools >= 0.25.1 or
salmon|alevin >= v1.4.0

General usages

0. Prerequiresits to download genome files

You need to download virus genome, prokaryotes genome, combined genome and virus combined genome in the following link and save them in a folder as "vmh_genome_dir" to be used in the next step. human_host_viruses_microbes.viruSITE.NCBIprokaryotes.with_hg38.removed_amb_viral_exon.gtf human_host_viruses_microbes.viruSITE.NCBIprokaryotes.with_hg38.fa human_host_viruses.viruSITE.with_hg38.removed_amb_viral_exon.gtf human_host_viruses.viruSITE.with_hg38.fa

1. Map 10x scRNA-seq reads to the viral microbial host reference set:

Usage: scvh_map_reads.pl [Options] <vmh_genome_dir> <R2> <R1> or <vmh_genome_dir> <.bam file>

Options:                                                                                                                                Defaults
-o/--output-dir <string>   the output directory                                                                                          [./]   
-t/--threads <int>         number of threads to run alignment with                                                                       [<1>]  
-d/--database <string>     select virus or virus and prokaryotes database, can be 'viruSITE' or 'viruSITE.NCBIprokaryotes'               [<viruSITE.NCBIprokaryotes>]
-e/--exe <string>          executable command or stand alone executable path of the alignment tool                                       [<>]
-s/--soloStrand <string>   STARsolo param: Reverse or Forward used for 10x 5' or 3' protocol, respectively                               [<Reverse>]
-w/--whitelist <string>    STARsolo param --soloCBwhitelist                                                                              [<"vmh_genome_dir"/737K-august-2016.txt>]
-r/--ram <int>             limitation of RAM usage. For STARsolo, param: limitGenomeGenerateRAM, limitBAMsortRAM unit by GB              [<128>]
-f/--soloFeature <string> STARsolo param:  See --soloFeatures in STARsolo manual                                                         [<Gene>]
-ot/--outSAMtype <string>  STARsolo param:  See --outSAMtype in STARsolo manual                                                          [<BAM SortedByCoordinate>]
-mm/--soloMultiMappers <string>  STARsolo param:  See --soloMultiMappers in STARsolo manual                                              [<EM>]
-a/--alignment <string>    Select alignment methods: 'STAR', 'KB', 'Alevin', or 'CellRanger'                                             [<STAR>]
-v/--technology <string>   KB param:  Single-cell technology used (`kb --list` to view)                                                  [<10XV2>]
--soloCBstart <string>  STARsolo param:  See --soloCBstart in STARsolo manual                                                            [<1>]
--soloCBlen <string>  STARsolo param:  See --soloCBlen in STARsolo manual                                                                [<16>]
--soloUMIstart <string>  STARsolo param:  See --soloUMIstart in STARsolo manual                                                          [<17>]
--soloUMIlen <string>  STARsolo param:  See --soloUMIlen in STARsolo manual                                                              [<10>]
--soloInputSAMattrBarcodeSeq <string>  STARsolo param:  See --soloInputSAMattrBarcodeSeq in STARsolo manual                              [<CR UR>]

For fastq file alignment option 'STAR', 'KB', and 'Alevin', run:

perl scvh_map_reads.pl -t num_threads -o output_dir vmh_genome_dir R2.fastq.gz R1.fastq.gz

where -t is a user-specified integer indicating number of threads to run with, output_dir is a user-specified directory to place the outputs, vmh_genome_dir is a pre-generated viral (and microbial) host (human) reference set directory, R2.fastq.gz and R1.fastq.gz are input 10x scRNA-seq reads.

For option 'CellRanger', run:

perl scvh_map_reads.pl -t num_threads -o output_dir vmh_genome_dir sample fastqs

where sample and fastqs are two cellranger arguments: --sample and --fastqs. See documentation in cellranger count to infer rules of fastq and sample naming.

For bam files, we only support STARsolo, run:

perl scvh_map_reads.pl -t num_threads -o output_dir vmh_genome_dir your_bam_file.bam

2. Filter the mapped UMIs using EmptyDrops to get the viral (and microbial) host filtered UMI counts matrix and also output viral genes and barcodes info files:

Usage: Rscript scvh_filter_matrix.r output_dir sample_name

where sample_name is an optional user-specified tag to be used as a prefix for the output files.

3. (Optional, and STARsolo or CellRanger only) Output some quality control criteria of the post-EmptyDrops viral microbial supporting reads in the BAM file

Usage: perl scvh_analyze_bam.pl output_dir sample_name

holab-hku / Vulture

readme