gapless

Combined scaffolding, gap-closing and assembly correction with long reads

Abstract
Requirements
Installation
Bioconda
Quick start examples
Parameter
Output files
FAQ
Publication

Abstract

Continuity, correctness and completeness of genome assemblies are important for many biological projects. Long reads represent a major driver towards delivering high-quality genomes, but not everybody can achieve the necessary coverage for good long-read-only assemblies. Therefore, improving existing assemblies with low-coverage long reads is a promising alternative. The improvements include correction, scaffold- ing and gap filling. However, most tools perform only one of these tasks and the useful information of reads that supported the scaffolding is lost when running sepa- rate programs successively. Therefore, we propose a new tool for combined execution of all three tasks using PacBio or Oxford Nanopore reads.

Requirements

The first list are the python requirements for gapless.py and the second list are the external programs called in gapless.sh.

Requirement	Min. version	Tested version	Installation instructions
Python 3	> 3.6.0	3.9.6
biopython		1.77	https://biopython.org/wiki/Download
matplotlib		3.4.2	https://matplotlib.org/stable/users/installing.html
numpy		1.21.1	https://numpy.org/install/
pandas	> 1.1.0	1.3.1	https://pandas.pydata.org/docs/getting_started/install.html
pillow		8.3.1	https://pillow.readthedocs.io/en/stable/installation.html
scipy	> 1.7.0	1.10.0	https://www.scipy.org/install.html
seaborn		0.11.1	https://seaborn.pydata.org/installing.html
---------------------------	--------------	----------------	-------------------------------------------------------------------
Linux system
minimap2		2.18-r1015	https://github.com/lh3/minimap2
racon		v1.4.13	https://github.com/lbcb-sci/racon
seqtk		1.3-r106	https://github.com/lh3/seqtk

Installation

No installation except for the requirements is necessary. The program can be directly called from its folder after downloading:

cd /where/you/want/to/download/
git clone https://github.com/schmeing/gapless.git
ls gapless

You may want to add the folder to your PATH variable to be able to call it from everywhere:

export PATH=/where/you/downloaded/gapless/:$PATH

If you insert this command into ~/.bashrc it will be automatically called when you login.

An alterantive is to create links to these files:

ln -s /where/you/downloaded/gapless/gapless.py /where/you/want/to/have/gapless/gapless.py
ln -s /where/you/downloaded/gapless/gapless.sh /where/you/want/to/have/gapless/gapless.sh

Bioconda

Gapless can also be downloaded with all python requirements in an automatic fashion via anaconda/miniconda(https://docs.conda.io/projects/continuumio-conda/en/latest/user-guide/install/index.html). However, updates will not be as frequent and the option to switch to the devel branch to get the most recent bugfixes is missing.

conda install -c bioconda gapless

To add the additional software used in gapless.sh from conda use:

conda install -c bioconda minimap2 racon seqtk

Quick start examples

The pipeline can be run with one of the following three commands depending on the type of long reads:

gapless.sh -j 30 -i assembly.fa.gz -t pb_clr data_pacbio_clr.fq.gz
gapless.sh -j 30 -i assembly.fa.gz -t pb_hifi data_pacbio_hifi.fq.gz
gapless.sh -j 30 -i assembly.fa.gz -t nanopore data_oxford_nanopore.fq.gz

The final output is linked to gapless_run/gapless.fa. Depending on the available number of cores you should change the -j parameter. 30 threads can finish a human sized genome with 30x coverage with the default 3 iterations in approximately half a day.

The pipeline essentially makes the following calls for each iteration:

gapless.py split -o gapless_split.fa {input_assembly}
minimap2 -t {threads} -DP -k19 -w19 -m200 gapless_split.fa gapless_split.fa > gapless_split_repeats.paf
minimap2 -t {threads} -x {read_type} -c -N 5 --secondary=no gapless_split.fa {input_reads} > gapless_reads.paf
gapless.py scaffold -p gapless -s gapless_stats.pdf gapless_split.fa gapless_reads.paf gapless_split_repeats.paf
minimap2 -t {threads} -x {read_type2} <(seqtk subseq {input_reads} gapless_extending_reads.lst) <(seqtk subseq {input_reads} gapless_extending_reads.lst) > gapless_extending_reads.paf
gapless.py extend -p gapless gapless_extending_reads.paf
gapless.py finish -o gapless_raw.fa -H 0 -s gapless_extended_scaffold_paths.csv gapless_split.fa <(seqtk subseq {input_reads} gapless_used_reads.lst)
minimap2 -t {threads} -x {read_type} gapless_raw.fa {input_reads} > gapless_consensus.paf
racon -t {threads} {input_reads} gapless_consensus.paf gapless_raw.fa > gapless.fa

{read_type} is map-pb, asm20 or map-ont depending on the type of long reads. {read_type2} is ava-pb, asm20 --min-occ-floor=0 -X -m100 -g10000 --max-chain-skip 25 or ava-ont depending on the type of long reads.

Parameter

gapless.sh [OPTIONS] {long_reads}.fq

Parameter	Default	Description
`-h` `-?`		Display this help and exit
`-i`	(mandatory)	Input assembly (fasta)
`-j`	4	Number of threads
`-n`	3	Number of iterations
`-o`	gapless_run	Output directory (improved assembly is written to gapless.fa in this directory)
`-r`		Restart at the start iteration and overwrite instead of incorporat already present files
`-s`	1	Start iteration (Previous runs must be present in output directory)
`-t`	(mandatory)	Type of long reads (`pb_clr`,`pb_hifi`,`nanopore`)

gapless.py split [OPTIONS] {assembly}.fa

Parameter	Default	Description
`-h` `--help`		Display this help and exit
`-n` `--minN`	1	Minimum number of N's to split at that position
`-o` `--output`	`{assembly}_split.fa`	File to which the split sequences should be written to

gapless.py scaffold [OPTIONS] {assembly}.fa {mapping}.paf {repeat}.paf

Parameter	Default	Description
`-h` `--help`		Display this help and exit
`-p` `--prefix`	`{assembly}`	Prefix for output files
`-s` `--stats`	(no stats)	Output file for plots with statistics regarding input parameters (pdf)
`--minLenBreak`	600	Minimum length for a read to diverge from a contig to consider a contig break
`--minMapLength`	400	Minimum length of individual mappings of reads
`--minMapQ`	20	Minimum mapping quality of reads
`--largeGenome`		Increases int size to handle scaffolds longer than ~ 2147 million bases

gapless.py extend -p {prefix} {all_vs_all}.paf

Parameter	Default	Description
`-h` `--help`		Display this help and exit
`-p` `--prefix`	(mandatory)	Prefix for output files of scaffolding step
`--minLenBreak`	1000	Minimum length for two reads to diverge to consider them incompatible for this contig

gapless.py finish [OPTIONS] -s {scaffolds}.csv {assembly}.fa {reads}.fq

Parameter	Default	Description
`-h` `--help`		Display this help and exit
`-f` `--format`	`fastq` or read ending	Format of `{reads}`.fq (`fasta`/`fastq`)
`-H` `--hap`	mixed	Haplotype starting from 0 written to `--output`
`--hap[1-9]`	mixed	Haplotypes starting from 0 written to `--out[1-9]`
`-o` `--output`	`{assembly}`_gapless.fa	Output file for modified assembly
`--out[1-9]`	(no additional output)	Additional output files for modified assembly
`-p` `--polishing`	(mandatory)	Input file for polishing read information
`-s` `--scaffolds`	(mandatory)	Csv file from previous steps describing the scaffolding

Intermediate and final output files in the pipeline

File	Program	Type	Information
gapless_split.fa	gapless.py split	temporary	Input assembly with the scaffolds split into contigs
gapless_split_repeats.paf	minimap2	temporary	Mapping of the split assembly to itself
gapless_reads.paf	minimap2	temporary	Mapping of the long reads to the split assembly
gapless_scaffold_paths.csv	gapless.py scaffold	intermediate	Table summarising the first and last base included for contigs and reads in the new scaffolds as well as their order and orientation for all haplotypes. Positions start at 0 and end positions are one after the last included position. The orientation is encoded with +/-. The first and last contig/read in a scaffold contain information about the distance to the next scaffold in case this information is present. Otherwise, the fields are encoded with -1. Identical phases mean that contig/reads are phased to be on the same haplotype. Negative phases mark contigs/reads identical to main haplotye (0). Empty contig/read names in combination with positive phases mark deletions.
gapless_extensions.csv	gapless.py scaffold	intermediate	Table summarising the mapping of extending reads to the assembly. How much they extend the new scaffolds, how far from the end they stop aligning (trim), the read distance to the next alignment (unmap_ext).
gapless_extending_reads.lst	gapless.py scaffold	intermediate	List of extending reads with one read name per line
gapless_stats.pdf	gapless.py scaffold	final	File containing plots with information about the run
gapless_extending_reads.paf	minimap2	temporary	All-vs.-all alignment of the extending reads
gapless_extended_scaffold_paths.csv	gapless.py extend	intermediate	Table including the extensions in the same format as gapless_scaffold_paths.csv
gapless_used_reads.lst	gapless.py extend	intermediate	List of reads inlcuded in gapless_extended_scaffold_paths.csv with one read name per line
gapless_raw.fa	gapless.py finish	temporary	Unpolished output assembly
gapless_consensus.paf	minimap2	temporary	Mapping of the long reads to the unpolished assembly
gapless.fa	racon	final	Polished output assembly

Intermediate files are required for the following steps in the pipeline, but are not removed when they are not needed anymore (in contrast to temporary files). Log files are stored in the logs folder and the ressource usage acquired with GNU time is written to the timing folder.

FAQ

Coming soon ...

Publication

Schmeing, S., Robinson, M.D. Gapless provides combined scaffolding, gap filling and assembly correction with long reads. bioRxiv (2022). https://doi.org/10.1101/2022.03.08.483466

schmeing / gapless

readme