hallamlab / FabFos

A python-driven fosmid processing pipeline
GNU General Public License v3.0
1 stars 3 forks source link

FabFos: an automated pipeline for resolving inserts from pooled fosmid DNA

Tony Liu, Mahan Rafiee, Connor Morgan-Lang, Joe Ho, Avery Noonan, Kateryna Ievdokymenko, Zach Armstrong, Steven Hallam

For the impatient

conda install -c hallamlab \
    -c imperial-college-research-computing \
    -c bioconda -c conda-forge fabfos

# minimal
fabfos --output ./example_out \
    --interleaved --reads /.../interleaved.fastq \
    --background /.../host_background_genome.fasta \
    --pool-size 384

# full
fabfos --output ./example_out \
    --threads 8 \
    --verbose \
    --assembler megahit \
    --reads /.../forward_reads.fastq \
    --reverse /.../reverse_reads.fastq \
    --parity pe \
    --background /.../host_background_genome.fasta \
    --vector /.../plasmid_backbone.fasta \
    --ends /.../end_sequences.fq \
    --ends-name-regex "\w+_\d+" \
    --ends-fw-flag "FW"

Overview:

Fabfos is a pipeline for resolving cloned inserts from pooled fosmid libraries using the following steps:

  1. Read QC
    • Filtering of the host background with BWA [7] and Samtools [4]
    • Quality trimming with Trimmomatic [3]
  2. Estimate pool size
    • Reads crossing one of the two junctions between the vector backbone and insert are prepared and fed to VSEARCH [10], which estimates the number of unique sequences as a proxy to the number of clones in the pool.
    • Alternatively, the user can provide an estimate directly.
  3. Assembly
    • Choice of Megahit [6] or Spades [9, 8, 2]
    • (Experimental) Can also use Nanopore long reads with CANU [5]
    • Fabfos calculates assembly statistics and coverage.
  4. Assessing the completeness of inserts using end sequences
    • End sequences covering both junctions between the insert and vector backbone are aligned to the assembled contigs using BLAST [1]. Contigs with mapped end sequences for both junctions are considered complete.

Install

Pick one of Conda, Singularity, Docker, or Manual.

Conda

conda install \
    -c hallamlab \
    -c imperial-college-research-computing \
    -c bioconda \
    -c conda-forge \
    fabfos

Consider using mamba, a drop in multithreaded replacement for conda

Singularity

singularity pull ./fabfos.sif docker://quay.io/hallamlab/fabfos

# example run
singularity exec \
    --bind ./:/ws \
    --workdir /ws \
    ./fabfos.sif fabfos --help

Docker

docker pull quay.io/hallamlab/fabfos

# example run
docker run -it --rm \
    -u $(id -u):$(id -g) \
    --mount type=bind,source="./",target="/ws"\
    --workdir="/ws" \
    quay.io/hallamlab/fabfos fabfos --help

Manual

Clone this repo

Install dependencies from .yml file

./dev.sh --ibase
# or
conda env create --no-default-packages -n fabfos_env -f ./envs/base.yml

Activate the environment

conda activate fabfos_env

and run the source code directly

./dev.sh -r --help
# or
cd ./src
python -m fabfos --help

Usage

Explanation of arguments

fabfos --output ./example_out \
    --threads 8 \
    --verbose \
    --assembler megahit \
    --reads /.../forward_reads.fastq \
    --reverse /.../reverse_reads.fastq \
    --parity pe \
    --background /.../host_background_genome.fasta \
    --vector /.../plasmid_backbone.fasta \
    --ends /.../end_sequences.fq \
    --ends-name-regex "\w+_\d+" \
    --ends-fw-flag "FW"

Explanation of arguments:

Expected outputs

Within the output folder specified in --output /.../NAME there will exist the following...

Some outputs may be ommitted if some inputs are not provided. For example, the end mapping tables will not exist if end sequences were not provided.

Examples

Example 1

Example 2

Example 3

Command line reference

usage: fabfos [-b BACKGROUND] [--vector VECTOR] [-n POOL_SIZE] [--nanopore_reads NANOPORE_READS] [--skip_correction]
              [-r READS] [-2 REVERSE] [-t {B,F}] [-i] [-p {pe,se}] [-a {spades_meta,spades_isolate,spades_sc,megahit}]
              [-o OUTPUT] [-T THREADS] [-e ENDS] [--ends-name-regex ENDS_NAME_REGEX] [--overwrite]
              [--ends-fw-flag ENDS_FW_FLAG] [-v] [--verbose] [-h]

Pipeline for filtering, assembling and organizing fosmid sequence information.

Required arguments:
  -b BACKGROUND, --background BACKGROUND
                        Path to fosmid background fasta
  --vector VECTOR       Path to vector backbone fasta, used for estimating pool size, required if --pool-size not
                        given
  -n POOL_SIZE, --pool-size POOL_SIZE
                        Estimate of number of fosmids in pool, required if --vector not given

Sequence read-specific arguments:
  -r READS, --reads READS
                        Path to the forward strand file or the interleaved paired-end file. Can be in either FastQ or
                        BAM format.
  -2 REVERSE, --reverse REVERSE
                        Path to the reverse-end read file (if applicable)
  -t {B,F}, --type {B,F}
                        Enter B if input type is BAM, F for FastQ. [ DEFAULT = 'F' ]
  -i, --interleaved     Flag indicating the reads are interleaved (i.e. forward and reverse pairs are in the same
                        file). [DEFAULT = False]
  -p {pe,se}, --parity {pe,se}
                        Specifying the sequencing chemistry used, either paired-end (pe) or single-end (se). [DEFAULT
                        = 'pe']

Nanopore-specific [development] options:
  --nanopore_reads NANOPORE_READS
                        A FASTA file containing nanopore reads to be used in assembly.
  --skip_correction     Do not perform error-correction of nanopore reads using proovread

Optional arguments:
  -a {spades_meta,spades_isolate,spades_sc,megahit}, --assembler {spades_meta,spades_isolate,spades_sc,megahit}
                        Genome assembly software to use. [DEFAULT = spades_isolate]
  -o OUTPUT, --output OUTPUT
                        path to temp. workspace [DEFAULT = /tmp]
  -T THREADS, --threads THREADS
                        The number of threads that can be used [DEFAULT = 8]
  -e ENDS, --ends ENDS  FASTA file containing fosmid ends - these will be used for alignment to fosmid contigs.
  --ends-name-regex ENDS_NAME_REGEX
                        regex for getting name of endseq., ex. "\w+_\d+" would get ABC_123 from ABC_123_FW
  --overwrite           overwrite the output directory if it already exists
  --ends-fw-flag ENDS_FW_FLAG
                        string that marks forward ends, ex. "_FW" would cause ABC_123_FW and ABC_123_RE to be assigned
                        to forward and reverse respectively.

Miscellaneous options:
  -v, --version         Print the FabFos version and exit.
  --verbose             Increase the level of verbosity in runtime log.
  -h, --help            Show this help message and exit

References

Please consider citing these along with Fabfos.

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. https://doi.org/10.1016/S0022-2836(05)80360-2

  2. Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A., 2012. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J. Comput. Biol. 19, 455–477. https://doi.org/10.1089/cmb.2012.0021

  3. Bolger, A.M., Lohse, M., Usadel, B., 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120. https://doi.org/10.1093/bioinformatics/btu170

  4. Danecek, P., Bonfield, J.K., Liddle, J., Marshall, J., Ohan, V., Pollard, M.O., Whitwham, A., Keane, T., McCarthy, S.A., Davies, R.M., Li, H., 2021. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008. https://doi.org/10.1093/gigascience/giab008

  5. Koren, S., Walenz, B.P., Berlin, K., Miller, J.R., Bergman, N.H., Phillippy, A.M., 2017. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736. https://doi.org/10.1101/gr.215087.116

  6. Li, D., Liu, C.-M., Luo, R., Sadakane, K., Lam, T.-W., 2015. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676. https://doi.org/10.1093/bioinformatics/btv033

  7. Li, H., Durbin, R., 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760. https://doi.org/10.1093/bioinformatics/btp324

  8. Nurk, S., Meleshko, D., Korobeynikov, A., Pevzner, P.A., 2017. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834. https://doi.org/10.1101/gr.213959.116

  9. Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., Korobeynikov, A., 2020. Using SPAdes De Novo Assembler. Curr. Protoc. Bioinforma. 70, e102. https://doi.org/10.1002/cpbi.102

  10. Rognes, T., Flouri, T., Nichols, B., Quince, C., Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4, e2584. https://doi.org/10.7717/peerj.2584