FabFos: an automated pipeline for resolving inserts from pooled fosmid DNA

Tony Liu, Mahan Rafiee, Connor Morgan-Lang, Joe Ho, Avery Noonan, Kateryna Ievdokymenko, Zach Armstrong, Steven Hallam

For the impatient

conda install -c hallamlab \
    -c imperial-college-research-computing \
    -c bioconda -c conda-forge fabfos

# minimal
fabfos --output ./example_out \
    --interleaved --reads /.../interleaved.fastq \
    --background /.../host_background_genome.fasta \
    --pool-size 384

# full
fabfos --output ./example_out \
    --threads 8 \
    --verbose \
    --assembler megahit \
    --reads /.../forward_reads.fastq \
    --reverse /.../reverse_reads.fastq \
    --parity pe \
    --background /.../host_background_genome.fasta \
    --vector /.../plasmid_backbone.fasta \
    --ends /.../end_sequences.fq \
    --ends-name-regex "\w+_\d+" \
    --ends-fw-flag "FW"

Overview:

Fabfos is a pipeline for resolving cloned inserts from pooled fosmid libraries using the following steps:

Read QC
- Filtering of the host background with BWA [7] and Samtools [4]
- Quality trimming with Trimmomatic [3]
Estimate pool size
- Reads crossing one of the two junctions between the vector backbone and insert are prepared and fed to VSEARCH [10], which estimates the number of unique sequences as a proxy to the number of clones in the pool.
- Alternatively, the user can provide an estimate directly.
Assembly
- Choice of Megahit [6] or Spades [9, 8, 2]
- (Experimental) Can also use Nanopore long reads with CANU [5]
- Fabfos calculates assembly statistics and coverage.
Assessing the completeness of inserts using end sequences
- End sequences covering both junctions between the insert and vector backbone are aligned to the assembled contigs using BLAST [1]. Contigs with mapped end sequences for both junctions are considered complete.

Install

Pick one of Conda, Singularity, Docker, or Manual.

Conda

conda install \
    -c hallamlab \
    -c imperial-college-research-computing \
    -c bioconda \
    -c conda-forge \
    fabfos

Consider using mamba, a drop in multithreaded replacement for conda

Singularity

singularity pull ./fabfos.sif docker://quay.io/hallamlab/fabfos

# example run
singularity exec \
    --bind ./:/ws \
    --workdir /ws \
    ./fabfos.sif fabfos --help

Docker

docker pull quay.io/hallamlab/fabfos

# example run
docker run -it --rm \
    -u $(id -u):$(id -g) \
    --mount type=bind,source="./",target="/ws"\
    --workdir="/ws" \
    quay.io/hallamlab/fabfos fabfos --help

Manual

Clone this repo

Install dependencies from .yml file

./dev.sh --ibase
# or
conda env create --no-default-packages -n fabfos_env -f ./envs/base.yml

Activate the environment

conda activate fabfos_env

and run the source code directly

./dev.sh -r --help
# or
cd ./src
python -m fabfos --help

Usage

Explanation of arguments

fabfos --output ./example_out \
    --threads 8 \
    --verbose \
    --assembler megahit \
    --reads /.../forward_reads.fastq \
    --reverse /.../reverse_reads.fastq \
    --parity pe \
    --background /.../host_background_genome.fasta \
    --vector /.../plasmid_backbone.fasta \
    --ends /.../end_sequences.fq \
    --ends-name-regex "\w+_\d+" \
    --ends-fw-flag "FW"

Explanation of arguments:

--output: The folder where Fabfos should store intermediate and output files. If it doesn't exist, Fabfos will create it.
--threads: Maximum number of threads to use. Fabfos itself will only ever use 1, due to the current limitations of python
--verbose: Include debug messages in printouts
--assembler: The assembler and preset to use. Options are:
- megahit megahit is significantly faster than spades with comparable assembly performance, however each option tends to resolve a slightly different set contigs so trying them all will produce the most complete set
- spades_meta
- spades_isolate
- spades_sc
--reads, --reverse: paths to the raw reads of the pooled
- default is paried end fastqs
- Fabfos should handle gzipped reads automatically
- for single end reads fabfos ... --reads /.../reads.fq --parity se, leaving out --reverse
- for interleaved reads fabfos ... --reads /.../reads.fq --interleaved, leaving out --reverse
--background: path to genomic fasta of host background. Fabfos filters out reads that map to the host. Example for E. coli k12
--vector: path to fasta of the vector backbone sequence. An example would be pcc1
- used to estimate the pool size
- can be replaced with a manual estimate, example: --pool-size 384, in which case Fabfos will not perform an estimate
(optional) --ends: end sequences should be sequenced inward from one of the two junctions between the insert and vector backbone
- --ends-name-regex: regex to pull the name of the clone from the header of the end sequence fastq. Example: "\w+_\d+" would get "ABC_123" from ">ABC_123_FW"
- --ends-fw-flag a token that, if found within the fastq header of the end sequence, would indicate that it was sequenced from the "forward" junction. Example: FW would indicate that ">ABC_123_FW" is the end sequence of the "forward" juction.
- if ommitted, assembled contigs will not be checked for completeness

Expected outputs

Within the output folder specified in --output /.../NAME there will exist the following...

temp_*/ work folders for various steps and tools
fabfos.log main log
NAME_metadata.tsv metadata table including read stats, assembly stats, and pool size estimate
NAME_fosmids_*.fasta resolved fosmids
- NAME_fosmids_all_contigs.fasta all assembled contigs
- NAME_fosmids_both_mapped.fasta inserts (contigs) with end sequences mapped to both ends NAME_fosmids_single_mapped.fastainserts (contigs) with only one end mapped to an end sequence
- NAME_fosmids_not_mapped contigs with that didn't map to any end sequence
NAME_end_mapping.tsv blast results from mapping end sequences to assembled contigs
NAME_end_mapping_failures.tsv table of provided end sequences that didn't map to any contig

Some outputs may be ommitted if some inputs are not provided. For example, the end mapping tables will not exist if end sequences were not provided.

Examples

Example 1

Paird end reads
megahit as assembler
Fabfos estimates pool size

End sequences

fabfos --output ./example_out \
--assembler megahit \
--reads /.../forward_reads.fastq \
--reverse /.../reverse_reads.fastq \
--background /.../host_background_genome.fasta \
--vector /.../plasmid_backbone.fasta \
--ends /.../end_sequences.fq \
--ends-name-regex "\w+_\d+" \
--ends-fw-flag "FW"

Example 2

Single end reads
spades_isolate as assembler
Fabfos estimates pool size

No end sequences

fabfos --output ./example_out \
--assembler spades_isolate \
--reads /.../se.fastq \
--parity se \
--background /.../host_background_genome.fasta \
--vector /.../plasmid_backbone.fasta \

Example 3

Interleaved reads
spades_meta as assembler
User provides pool size estimate