mrmckain / Fast-Plast

Automated de novo assembly of whole chloroplast genomes.
MIT License
44 stars 13 forks source link

Fast-Plast: Rapid de novo assembly and finishing for whole chloroplast genomes DOI

Authors: Michael R. McKain, Mark Wilson

Version 1.2.9

Contact: https://github.com/mrmckain

fast-plast_logo

Description

Fast-Plast is a pipeline that leverages existing and novel programs to quickly assemble, orient, and verify whole chloroplast genome sequences. For most datasets with sufficient data, Fast-Plast is able to produce a full-length de novo chloroplast genome assembly in less than 10 minutes with no user mediation. In addition to a chloroplast sequence, Fast-Plast identifies chloroplast genes present in the final assembly.

Currently, Fast-Plast is written to accomodate Illumina data, although most data types could be used.

Fast-Plast uses a de novo assembly approach by combining the De Bruijn graph-based method of SPAdes with an iterative seed-based assembly implemented in afin to close gaps of contigs with low coverage. The pipeline then identifies regions from the quadripartite structure of the chloroplast genome, assigns identity, and orders them according to standard convention. A coverage analysis is then conducted to assess the quality of the final assembly.

Dependencies

Fast-Plast requires a number of commonly used bioinformatics programs. We have included an installation script to help users properly prepare Fast-Plast for use.

Coverage Analysis

Memory requirements will vary based on the size of your data set. Expect to use 1.5-2x the memory for the size of your reads files. If your data set is exceptionally large, we have found success in reducing the dataset to 5-10 million reads and running them through Fast-Plast.

Installation

Fast-Plast has been tested on Linux (CentOs 7) but should be compatible with any flavor that can handle dependencies.

Clone the Github repository:

git clone https://github.com/mrmckain/Fast-Plast.git

To install, run the INSTALL.pl script found in the Fast-Plast repository.

 perl INSTALL.pl

The installation script will walk you through installation. If you already have the dependencies installed, you will need to give the full path to the executables. The installation script can also install all dependencies for you. To do this, select "All" when prompted. Dependencies will be installed in Fast-Plast/bin/. This script will also change paths in the control script, compile afin and jellyfish2, and unzip the default chloroplast genomes for mapping.

For advanced users, paths can be set directly in the fast-plast.pl file:

###directories
my $FPROOT = "$FindBin::RealBin";
my $AFIN_DIR = "$FPROOT/afin";
my $COVERAGE_DIR = "$FPROOT/Coverage_Analysis";
my $FPBIN = "$FPROOT/bin";
my $TRIMMOMATIC;
my $BOWTIE2;
my $SPADES;
my $BLAST;
my $SSPACE;
my $BOWTIE1;
my $JELLYFISH;

Instructions for direct compilation of afin can be found here.

Input

Reads

Input files are in FASTQ format. The data are not expected to be adapter trimmed or quality filtered, though this will not impede the assembly.

Fast-Plast was built for genome survey sequence (aka genome skimming or low-pass genome sequencing) data. Sequence capture data can be used but needs to be normalized first. If your data are from a sequence capture experiment (aka target enrichments or anchored phylogenomics), we suggest using the normalization method packaged with Trinity, khmer, or bbnorm.

Bowtie Index

Fast-Plast is packaged with 1,021 whole chloroplast genomes from GenBank. These cover a wide range of diversity including marine algae, angiosperms, ferns, etc. To access these, use the order of your species in the --bowtie_index option. Fast-Plast will pull all members of that order and create a bowtie index. If your order is not present, Fast-Plast will use a representative sequence from all orders present. This option can be selected using "All" or "GenBank". "All" is the default.

Example:

--bowtie_index Poales

This will use all available Poales plastomes in the data set (118).

Orders currently available in Fast-Plast (70):

Alismatales     Dipsacales  Marchantiales
Apiales         Ericales    Monomastigales
Aquifoliales        Euglenales  Myrtales
Araucariales        Eupodiscales    Nymphaeales
Asparagales     Fabales     Orthotrichales
Asterales       Fagales     Pinales
Austrobaileyales    Fragilariales   Poales
Bangiales       Fucales     Polypodiales
Brassicales     Funariales  Proteales
Bryopsidales        Garryales   Pyrenomonadales
Buxales         Gentianales Ranunculales
Caryophyllales      Geraniales  Rosales
Celastrales     Ginkgoales  Sapindales
Chlamydomonadales   Gracilariales   Saxifragales
Chloranthales       Hypnales    Solanales
Chlorellales        Lamiales    Sphaeropleales
Cornales        Laminariales    Takakiales
Cucurbitales        Laurales    Ulvales
Cupressales     Liliales    Vaucheriales
Cyanidiales     Lycopodiales    Vitales
Cyatheales      Magnoliales Zingiberales
Cycadales       Malpighiales    Zygnematales
Desmidiales     Malvales    
Dioscoreales        Mamiellales     

User Provided Bowtie Index

A user-made bowtie index can be provided using the --user_bowtie option. The full path to the index should be given. If this option is used, the --bowtie_index option will be ignored.

Name

The --name option should be used with each run. This simply gives a prefix to all files. The default is "Fast-Plast".

Output

Fast-Plast produces a number of files that allow the user to trace the steps of the pipeline. From the directory where Fast-Plast is called, three files and a new directory will be produced. The directory will be named by the --name option. The three files include:

name_Fast-Plast_Progress.log
Gives time and results for each step in the pipeline. Information regarding paramters chosen based on reads (such as kmer size) and chloroplast gene content will be found here.

name_result_out.log
Contains the STDOUT from all programs.

name__results_error.log
Contains the STDERR from all programs.

Directory Hierarchy

name_Plastome_Summary.txt
Provides information on the number of reads, reads mapped, assembly size, chloroplast region size, and average coverage of each region.

1_Trimmed_Reads

For paired-end data, three trimmed read files will be made. Files ending in trimmed_P1.fq and trimmed_P2.fq are still paired-end. The files ending in trimmed_UP.fq is single-end. If only single end files are used, then only the trimmed_UP.fq file will be found.

2_Bowtie_Mapping

Fast-Plast created bowtie index files will be found in this directory. Reads that mapped to the bowtie index are in the files map_pair_hits.1.fq, map_pair_hits.2.fq, and map_hits.fq for paired-end and single-end respectively. The file name.sam is the standard bowtie2 mapping output but is not used.

3_Spades_Assembly

The directory "spades_iter1" will contain the SPAdes assembly and standard SPAdes output files.

4_Afin_Assembly

The file "filtered_spades_contigs.fsa" contains contigs from the SPAdes assemblies that fall within the range of minus one standard deviation of the weigthed mean coverage to plus 2.5 standard deviations.

Output from afin is the files _afin_iter0.fa, _afin_iter1.fa, _afin_iter2.fa, and _afin.log. The log file demonstrates the steps afin took in the extension and assembly process.

Chloroplast_gene_composition_of_afin_contigs_nested_removed.txt contains information regarding the chloroplast gene content of the each contig found *_afin_iter2.fa.

If only single end data were used and more than one contig is found, Fast-Plast will quit here. The final output will be available in Final_Assembly.

Scaffolding

If more than one contig is found in the file afin assembly and paired-end reads were used, SSPACE will be invoked to attempt scaffolding of contigs. These results will be found here. If more than one contig/scaffold is present in the final output, this will be the last step of the pipeline and results will be found in Final_Assembly.

5_Plastome_Finishing

If a single contig was found through either contig assembly or scaffolding, this directory will be created to contain files associated with identification of the large single copy, small single copy, and inverted repeats. If these are not present, this will be the last step of the pipeline and results of the single (unorientated) contig from the assembly steps will be found in Final_Assembly.

Final_Assembly

The final assembly will be found in this directory. If the pipeline was able to fully assemble and orientate the plastome, the files _CP_pieces.fsa (plastome split into LSC, SSC, and IR), _FULLCP.fsa (final assembly), and Chloroplast_gene_composition_of_final_contigs.txt (all chloroplast genes found in final assembly) will be present. The file Chloroplast_gene_composition_of_final_contigs.txt will always be made for the final assembly regardless of where Fast-Plast stops.

Coverage_Analysis

The coverage analysis option should also be used to ensure accurate assembly of the plastome. Multiple files associated with the coverage estimation process will be present in this directory. The three most important files are:

name.coverage_25kmer.txt
Contains 25-mer sequence, start position, and coverage across final assembly.

name_coverage.pdf
Graphical representation of name.coverage_25kmer.txt. Red circles indicate a coverage of 0 and potential assembly issue.

name_problem_regions_plastid_assembly.txt
Identified stretches of the assembly greater than 25 base pairs that have a coverage of 0. If this file is empty, the assembly is accepted.

Other files include the mapped reads from the Bowtie2 run (map_hits*) and those associated with Bowtie2 and Jellyfish.

4.5_Reassemble_Low_Coverage

If regions of low coverage are identified after the Coverage Analysis, then these regions are removed, the contig broken into pieces, and reassembled from the afin step. All of the reassembly steps (Afin Assembly and Plastome Finishing) will be conducted in this directory.

Final_Assembly_Fixed_Low_Coverage

The final reassembled plastome, chloroplast regions, and chloroplast gene recovery will be in this directory.

Coverage_Analysis_Reassembly

Coverage analysis and results of the reassembled plastome will be in this directory.

Usage

General Syntax

fast-plast.pl [-1 -2 || --single ] -name [options]

Example with Paired-End Data

perl fast-plast.pl -1 /home/mmckain/Sequence_Vault/Washburn_Data/37_Urochloa_fusca_42940_RPGH_AGTCAA_L005_R1_001.fastq.gz -2 /home/mmckain/Sequence_Vault/Washburn_Data/37_Urochloa_fusca_42940_RPGH_AGTCAA_L005_R2_001.fastq.gz --name Urochloa_fusca-37 --bowtie_index Poales --coverage_analysis --clean light

In this example, one pair-end library is being used for assembly. The default adapters (NEB) are used for trimming, Poales species are used for the Bowtie2 index, the coverage analysis is invoked, and a light cleaning is done after completition.

Example with Single End Data

perl fast-plast.pl --single /home/mmckain/Sequence_Vault/Andropogoneae_GSS/Chionachne_koenigii-TK057/K17_GTCCGC_L006_R1_001.fastq.gz --name Chionachne_koenigii-TK057 --adapters TruSeq --bowtie_index All --coverage_analysis

In this example, one single end library is being used for assembly. The TruSeq adapters are used for trimming, a representative from each order is used for the Bowtie2 index, the coverage analysis is invoked, and no cleaning is done.

Example with Mixed Libraries

perl fast-plast.pl -1 /home/mmckain/Sequence_Vault/Andropogoneae_GSS/Monocymbium_ceresiiforme-TK203/TK203_GCTACGCT-AGAGTAGA_Both_R1.fastq.gz -2 /home/mmckain/Sequence_Vault/Andropogoneae_GSS/Monocymbium_ceresiiforme-TK203/TK203_GCTACGCT-AGAGTAGA_Both_R2.fastq.gz --single /home/mmckain/Sequence_Vault/Andropogoneae_GSS/Monocymbium_ceresiiforme-TK203/Monocymbium_ceresiiforme-TK203_GA_122949-34_S42_R1_001.fastq --name Monocymbium_ceresiiforme-TK203 --user_bowtie /home/mmckain/Andropogoneae_Plastomes/FINISHED/androplast --coverage_analysis --clean deep

In this example, one single end library and one paired-end library are being used for assembly. The default adapters (NEB) are used for trimming, a user-defined Bowtie2 index (base name given) is used for the Bowtie2 index, the coverage analysis is invoked, and a deep cleaning is done after completion.

Example with Coverage Analysis Only

perl ~/bin/Fast-Plast/fast-plast.pl -1 /home/mmckain/Sequence_Vault/MSU_HiSeq4000_05032017/20170428_DNASeq_PE/20170428_DNASeq_PE/TK686R_S48_L005_R1_001.fastq.gz -2 /home/mmckain/Sequence_Vault/MSU_HiSeq4000_05032017/20170428_DNASeq_PE/20170428_DNASeq_PE/TK686R_S48_L005_R2_001.fastq.gz --name Schizachyrium_scoparium-TK686R --only_coverage /home/mmckain/DASH_Phylogeny/GSS_Plastomes/Schizachyrium_scoparium-TK686R/Schizachyrium_scoparium-TK686R/Final_Assembly/Schizachyrium_scoparium-TK686R_FULLCP.fsa --min_coverage 2 &

In this example, a paired end library is being used. The --only_coverage option is used with the path to the fasta file of the chloroplast genome provided. A minimum coverage of 2 is used.

Definitions:

    -1 <filenames>      File with forward paired-end reads. Multiple files can be designated with a comma-delimited list.
                Read files should be in matching order with other paired end files.
    -2 <filenames>      File with reverse paired-end reads. Multiple files can be designated with a comma-delimited list.
                Read files should be in matching order with other paired end files.
    --single <filenames>    File with unpaired reads. Multiple files can be designated with a comma-delimited list.

    **PAIRED END AND SINGLE END FILES CAN BE PROVIDED SIMULTANEOUSLY.**

    -n <sample_name>    Name for current assembly. We suggest a species name/accession combination as Fast-Plast will use 
                this name as the FASTA ID in the final assembly. [Default = Fast-Plast]

    Advanced options:

    --min_length_trim   Acceptable minimum length for reads after trimming for adapters and quality. (Default = 140]
    --subsample     Number of reads to subsample. Reads will be evenly pulled from all files. 
    --threads       Number of threads used by Fast-Plast.  [Default = 4]
    --min_coverage      Lowest acceptable coverage for 25-mer sliding window during coverage analysis. [Default = 0.25 * Average coverage]
    --min_filter_spades Minimum coverage allowed for SPAdes contig to be passed to afin. Only recommended to change if default is not working. [Default is one                      standard deviation of weigthed average length of contigs.]
    --adapters      [NEB|Nextera|TruSeq] Files of adapters used in making sequencing library. NEB, Nextera, and TruSeq options 
                available. Also accepts the path to a user created FASTA file of adapters.[Default = NEB]
    --bowtie_index      Taxonomic order of the sequenced species to pick references for bowtie2 indices. If the order is in the database, 
                then all available samples for that order will be used. If order does not exist in database or the terms "all" or "GenBank" are given, one 
                exemplar from each available order is used to build the Bowtie2 indices. Users may also specify multiple taxa separated by commas (","). Any of the following taxonomic levels is accepted for bowtie_index: genus, species epithet, tribe (if applicable), subfamily (if applicable), family, and order. [default="All"]
    --user_bowtie       User supplied bowtie2 indices. If this option is used, bowtie_index is ignored.
    --coverage_analysis     Flag to run the coverage analysis of a final chloroplast assembly.[Recommended]
        --skip              Flag to skip trimming. Must include option "trim". [--skip trim]
    --only_coverage         Option allows user to run coverage analysis directly on a provided chloroplast genome. [requires: 
                                read files, chloroplast genome sequence]
        --clean         [light|deep] The "light" option will remove all bowtie indices, BLAST databases, SAM files,
                Jellyfish dumps, and Jellyfish kmer files. The "deep" option will remove all directories except for the 
                Final Assembly and Coverage Analysis directories. All files in the "light" option will also be removed. 
                Clean will only be invoked if a fully successful assembly is made. 
    --posgenes      User defined genes for identification of single copy/IR regions and orientation. Useful when major 
                rearrangments are present in user plastomes. This is a fairly advanced option; please contact if you are interested
                in using it.

Changelog

References