nhmvienna / AutDeNovo

MIT License
1 stars 0 forks source link

Automated de-novo assembly pipeline (AutDeNovo)

The purpose of this repository is to provide a simple yet state-of-the-art de-novo assembly pipeline of non-human (museum) samples based on either high-quality short-read Illumina sequencing data, PacBio long reads based on SMRTcell technology or Oxford Nanopore long-read sequencing data after high accuracy basecalling. The workflow allows to use each data type separately or in any combination of sequencing data. The QC and Illumina-specific steps of this pipeline are largely based on the workflow kindly provided by Tilman Schell from the LOEWE Centre for Translational Biodiversity Genomics​ in Frankfurt. Note, that this pipeline is specifically tailored to our server infrastructure and can thus only be run at the NHM without further modifications.

Input

The pipeline requires these two obligatory input parameters:

In addition, you need to provide the paths to at least one input dataset, which can be either (1) high-quality short-read Illumina sequencing data, (2) PacBio long reads based on SMRTcell technology or (3) Oxford Nanopore long-read sequencing data after high accuracy basecalling.

Illumina

Oxford Nanopore

Pacific Biosciences

In addition, there are multiple optional parameters that can be set:

(1) Threads The total number of cores to be used for parallel computation.

(2) RAM The total amount of RAM memory to be reserved for parallel computation of analyses, except for the de-novo assembly.

(3) RAM for thew denovo assmebly The total amount of RAM memory to be reserved for the de-novo assembly.

(4) Decontamination with Kraken You can optionally decontaminate your raw reads from contamination with bacterial and human DNA using the Kraken pipeline (see below). However, my experience has shown that this approach results in a high false positive rate (wrongly detected human contamination) when analysing vertebrate data.

(5) Estimate ploidy with SmudgePlot You can optionally estimate the ploidy of your sample using the program SmudgePlot (see below) . However, this is EXTREMELY memory hungry and cannot be parallelized. Thus, these analyses may require days or evern weeks to finish; use with caution!

(6) BUSCO you can optionally also provide the name of the BUSCO database which should be used for BUSCO analyses during the quality control steps to evaluate the quality and the completeness of the denovo assembly.

(7) Trimmig you can optionally choose between four different trimming programs for Illumina data: (1) Atria, (2) FastP, (3) TrimGalore and (4) UrQt. By default, TimGalore is used.

(8) Base Quality you can optionally choose the minimum PHRED-scaled basequality for trimming.

(9) Minimum Read Length you can optionally choose the minimum read length after trimming. If a read is shorter than the threshold, the whole read-pair gets discarded.

(8) Polishing you can optionally choose to polish the raw contigs with Racon:

Command

The pipeline is a simple shell script that executes a series of sub-shell scripts that serially send jobs to OpenPBS. A typcial commandline looks like this:

## get repository
git clone https://github.com/nhmvienna/AutDeNovo

## change to repository folder
cd AutDeNovo

## run pipeline on test dataset

./AutDeNovo.sh \
  Name=SomeFish \
  OutputFolder=Test/SomeFish \
  Fwd=Test/subset/Illumina/Garra474_1.fq.gz \
  Rev=Test/subset/Illumina/Garra474_2.fq.gz \
  ONT=Test/subset/ONT \
  PB=Test/subset/PacBio \
  threads=10 \
  RAM=20 \
  RAMAssembly=20 \
  decont=no \
  SmudgePlot=no \
  BuscoDB=vertebrata_odb10 \
  Trimmer=TrimGalore \
  MinReadLen=85 \
  BaseQuality=20 \
  Racon=4

Pipeline

Below, you will find a brief description of the consecutive analysis steps in the pipeline, as well as links to further literature.

(1) Trimming and base quality control

In the first step, the pipeline uses trim_galore for quality trimming and for quality control in case the input is Illumina sequencing data. More specifically, Trim Galore is a Perl-based wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files. The parameters used by AutDeNovo are (1) automated adapter detection, (2) a minimum base quality of 20 and (3) a minimum read length of 85. This means that terminal bases with base-quality <20 will be trimmed away and that only intact read-pairs with a minimum length of 85bp each will be retained. After that, FASTQC is generating a html output for a visual inspection of the filtered and trimmed data quality. A browser window will autmatically load once the pipeline is finished. Please check the trim_galore and Cutadapt documentation for more details about the trimming algorithm. In case of ONT and PacBio data, the quality control will be carried out by NanoPlot. Optionally, it is also possible to use three other trimming programs: Atria, FastP, or UrQt.

(2) [Optional] Detection of microbial and human contamination

In a second step, the pipeline invoces kraken2, which was originally conceived for metagenomic analyses, for detecting and removing contamination with human or microbial DNA in the filtered reads. In brief, Kraken2 is a taxonomic classification system using exact k-mer matches to a reference database for high accuracy and fast classification speeds. This step retains decontaminated reads without hits in the reference database. After the pipeline is completed, Firefox will load the browser-based application Pavian which allows to visulaize the Kraken output, i.e. <out>/results/kraken_reads/<name>.report, where <out> is the output directory and name is the sample name, see above. However, my experience has shown that this approach results in a high false positive rate (wrongly detected human contamination) when analysing vertebrate data. Use with caution!

(3) Genome-size estimation

After that, the pipeline uses Jellyfish for counting k-mers in the filtered and decontaminated reads and Genomescope to estimate the approximate genomesize. Conceptually, the number of unique k-mers and their coverage allows for a rough estimation of the genome-size, which is a function of the sequencing depth and the number of unique kmers. Specifcally, for a given sequence of length L, and a k-mer size of k, the total k-mer’s (N) possible numbers will be given by N = ( Lk ) + 1. In genomes > 1mb, N (theoretically) converges to the true genomesize. Since the genome is usually sequenced more than 1-fold, the number of total kmers further needs to be divided by the coverage. However, the average coverage is usually unknown and influenced by seqencing errors, heterozygosity and repetitive elements. The average coverage is thus estimated from the empirical coverage distribution at unique kmers. For a more detailed description, see this great tutorial. Genomescope calculates the estimated genome-size, the prospective ploidy level and the ratio of unique and repetitive sequences in the genome. This is summarized in a historgramm plot and a summary text file. Optionally, the program smudgeplot will additionally estimate the ploidy of the dataset based on the distribution of coverages. However, this step is very memory and time intesive. Use with caution!

(4) De-novo assembly

In case Illumina high-quality sequencing data are available, the de-novo assembly is based on SPAdes with standard parameters using trimmed and decontaminated reads. In case a combination of Illumina high-quality sequencing data and long-read data (ONT and/or PacBio) is available, the pipleine will nevertheless employ SPAdes for de-novo assembly based on Illumina reads. The long-read sequencing data will in this case be used for scaffolding. See here for more details on how SPAdes works. Conversely, the pipeline employs the Flye assembler with standard parameters for either ONT or PacBio long-read sequencing data alone or for a combination of both. If ONT and PacBio reads are processed jointly, the pipleine follows the best practice recommendations and uses the ONT data for the initial assembly and the PacBio data for polishing.

(5) [Optional] Contig polishing with Racon

As an optional final step, the raw contigs can be polished based on the raw reads using Racon. This step can be iterated multiple times by setting parameter Racon="", where "" is the number iterations. For example, if you choose 3, Racon polishing will be repeated three times. If multiple datatypes are provide for assembly, only one read-type will be used for polishing in the following preference order: ONT > PB > ILL.

(6) Assembly QC

After the assembly is finished, the quality of the assembled genome will be assessed with a combination of different analysis tools as described below:

(a) Quast

Summary statistics of the SPAdes assembly, i.e. #contigs, N50, N90, etc. with QUAST. Check out the QUAST manual for more details. More details on the pipeline will be added soon.

(b) BLAST

BLASTing each contig against a local copy of the NCBI nt database. Only the Top 10 hits with e < 1e-25 are retained for each hit, which allows estimating off-target DNA contamination in the library. The result table is quantified with BlobTools (see below)

(c) BUSCO

Detecting of Benchmarking Universal Single-Copy Orthologs (BUSCO) from all vertebrates (by default vertebrata_odb10) in the de-novo assembled scaffolds to estimate the completeness of the assembly. The result table is quantified with BlobTools (see below). Moreover, the resulting FASTA files containing the amino-acid sequences of orthologs can be used for other downstream analyses (see below).

(d) Mapping

The trimmed reads are remapped with bwa (Illumina reads), or minimap2 (ONT and PacBio reads) to the assembled scaffolds to investigate the variation in read depth among the different scaffolds, which may indicate off-target DNA contamination in the library. These results are quantified with BlobTools (see below).

(e) BlobTools

Blobtools allows a quantitative analysis of assembly quality based on variation of read-depth, GC-content and taxonomic assignment of each scaffold. The summary plots and tables are accessible through an interactive browser window. Note, occasionally port 8001 may be already occuppied. In this case, you can retry to load the html page with ports of increasing number, e.g. 8002, 8003, etc.

Output

After the pipeline is finished, the scaffolds of the de-novo assembly and the most important QC outputs will be copied to the /output folder. These output files include:

In addition, various html-based results can be loaded in Firefox. The file HTML_outputs.sh in the /output folder contains all commands to load the HTML output in Firefox at a later timepoint. Moreover, the full pipeline including all commands will be written to a file named pipeline.sh in the /shell folder. In particular, this file allows to repeat certain analyses in the whole pipeline.

ChangeLog

v.2.3 (16/01/2023)

Minor update with several improvements

v.2.2 (15/10/2022)

Minor update with several improvements

v.2.1 (10/08/2022)

Minor update with several improvements

v.2.0 (10/06/2022)

Major update with several improvements

v.1.0 (08/03/2022)

This is just the very first version and I will implement more functionality in the near future. Additional features may include:

Potential future updates

Please le me know if you have further ideas or need help by posting an issue in this repository.