holab-hku / Vulture

AWS Batch based scalable microbial reads calling pipeline
MIT License
7 stars 1 forks source link

Vulture: Scalable microbial calling pipeline on AWS Cloud

Introduction

Vulture is a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing data, enabling the meta-analysis of the single-cell host-microbial studies from the AWS Open Data and other public domain. We named our pipeline Vulture because Vultures are birds that fly the highest above the 'cloud' and as a scavenger can defend themselves from harmful pathogens.

Image

Run Vulture on the AWS cloud

For the scalable Vulture usage on the AWS cloud, please kindly refer to our hands-on tutorial page at: Vulture tutorial on the cloud

Run Vulture on local machines

Map 10x scRNA-seq reads to the viral (and microbial) host reference set using STARsolo, CellRanger, Kallisto|bustools, or Salmon|Alevin.

Requirements

Input data

General usages

0. Prerequiresits to download genome files

You need to download virus genome, prokaryotes genome, combined genome and virus combined genome in the following link and save them in a folder as "vmh_genome_dir" to be used in the next step. human_host_viruses_microbes.viruSITE.NCBIprokaryotes.with_hg38.removed_amb_viral_exon.gtf human_host_viruses_microbes.viruSITE.NCBIprokaryotes.with_hg38.fa human_host_viruses.viruSITE.with_hg38.removed_amb_viral_exon.gtf human_host_viruses.viruSITE.with_hg38.fa

1. Map 10x scRNA-seq reads to the viral microbial host reference set:

Usage: scvh_map_reads.pl [Options] <vmh_genome_dir> <R2> <R1> or <vmh_genome_dir> <.bam file>

Options:                                                                                                                                Defaults
-o/--output-dir <string>   the output directory                                                                                          [./]   
-t/--threads <int>         number of threads to run alignment with                                                                       [<1>]  
-d/--database <string>     select virus or virus and prokaryotes database, can be 'viruSITE' or 'viruSITE.NCBIprokaryotes'               [<viruSITE.NCBIprokaryotes>]
-e/--exe <string>          executable command or stand alone executable path of the alignment tool                                       [<>]
-s/--soloStrand <string>   STARsolo param: Reverse or Forward used for 10x 5' or 3' protocol, respectively                               [<Reverse>]
-w/--whitelist <string>    STARsolo param --soloCBwhitelist                                                                              [<"vmh_genome_dir"/737K-august-2016.txt>]
-r/--ram <int>             limitation of RAM usage. For STARsolo, param: limitGenomeGenerateRAM, limitBAMsortRAM unit by GB              [<128>]
-f/--soloFeature <string> STARsolo param:  See --soloFeatures in STARsolo manual                                                         [<Gene>]
-ot/--outSAMtype <string>  STARsolo param:  See --outSAMtype in STARsolo manual                                                          [<BAM SortedByCoordinate>]
-mm/--soloMultiMappers <string>  STARsolo param:  See --soloMultiMappers in STARsolo manual                                              [<EM>]
-a/--alignment <string>    Select alignment methods: 'STAR', 'KB', 'Alevin', or 'CellRanger'                                             [<STAR>]
-v/--technology <string>   KB param:  Single-cell technology used (`kb --list` to view)                                                  [<10XV2>]
--soloCBstart <string>  STARsolo param:  See --soloCBstart in STARsolo manual                                                            [<1>]
--soloCBlen <string>  STARsolo param:  See --soloCBlen in STARsolo manual                                                                [<16>]
--soloUMIstart <string>  STARsolo param:  See --soloUMIstart in STARsolo manual                                                          [<17>]
--soloUMIlen <string>  STARsolo param:  See --soloUMIlen in STARsolo manual                                                              [<10>]
--soloInputSAMattrBarcodeSeq <string>  STARsolo param:  See --soloInputSAMattrBarcodeSeq in STARsolo manual                              [<CR UR>]

For fastq file alignment option 'STAR', 'KB', and 'Alevin', run:

perl scvh_map_reads.pl -t num_threads -o output_dir vmh_genome_dir R2.fastq.gz R1.fastq.gz

where -t is a user-specified integer indicating number of threads to run with, output_dir is a user-specified directory to place the outputs, vmh_genome_dir is a pre-generated viral (and microbial) host (human) reference set directory, R2.fastq.gz and R1.fastq.gz are input 10x scRNA-seq reads.

For option 'CellRanger', run:

perl scvh_map_reads.pl -t num_threads -o output_dir vmh_genome_dir sample fastqs

where sample and fastqs are two cellranger arguments: --sample and --fastqs. See documentation in cellranger count to infer rules of fastq and sample naming.

For bam files, we only support STARsolo, run:

perl scvh_map_reads.pl -t num_threads -o output_dir vmh_genome_dir your_bam_file.bam

2. Filter the mapped UMIs using EmptyDrops to get the viral (and microbial) host filtered UMI counts matrix and also output viral genes and barcodes info files:

Usage: Rscript scvh_filter_matrix.r output_dir sample_name

where sample_name is an optional user-specified tag to be used as a prefix for the output files.

3. (Optional, and STARsolo or CellRanger only) Output some quality control criteria of the post-EmptyDrops viral microbial supporting reads in the BAM file

Usage: perl scvh_analyze_bam.pl output_dir sample_name