PAFScaff v0.7.1
For a better rendering and navigation of this document, please download and open ./docs/pafscaff.docs.html
, or visit https://slimsuite.github.io/pafscaff/.
Documentation can also be generated by running PAFScaff with the dochtml=T
option. (R and pandoc must be installed - see below.)
PAFScaff is designed for mapping genome assembly scaffolds to a closely-related chromosome-level reference genome assembly. It uses (or runs) Minimap2 to perform an efficient (if rough) all- against-all mapping, then parses the output to assign assembly scaffolds to reference chromosomes.
Mapping is based on minimap2-aligned assembly scaffold ("Query") coverage against the reference chromosomes.
Scaffolds are "placed" on the reference scaffold with most coverage. Any scaffolds failing to map onto any
chromosome are rated as "Unplaced". For each reference chromosome, PAFScaff then "anchors" placed assembly
scaffolds starting with the longest assembly scaffold. Each placed scaffold is then assessed in order of
decreasing scaffold length. Any scaffolds that do not overlap with already anchored scaffolds in terms of the
Reference chromosome positions they map onto are also considered "Anchored". if newprefix=X
is set, scaffolds
are renamed with the Reference chromosome they match onto. The original scaffold name and mapping details are
included in the description. Unplaced scaffolds are not renamed.
Finally, Anchored scaffolds are super-scaffolded by inserting gaps of NnNnNnNnNn
sequence between anchored
scaffolds. The lengths of these gaps are determined by the space between the reference positions, modified by
overhanging query scaffold regions (min. length 10). The alternating case of these gaps makes them easy to
identify later.
PAFScaff outputs renamed, sorted and reoriented scaffolds in fasta format, along with mapping details:
*.anchored.fasta
, *.placed.fasta
and *.unplaced.fasta
contain the relevant subsets of assembly scaffolds,
renamed and/or reverse-complemented if appropriate.*.scaffolds.fasta
contains the super-scaffolded anchored scaffolds.*.scaffolds.tdt
contains the details of the PAFScaff mapping of scaffolds to chromosomes.*.log
contains run details, including any warnings or errors encountered.NOTE: The precise ordering, orientation and naming of the output scaffolds depends on the settings for:
refprefix=X newprefix=X sorted=T/F revcomp=T/F
. See main documentation (below) for details.
The main minimap-based PAFScaff approach has been published as part of the German Shepherd Dog genome paper:
Field MA, Rosen BD, Dudchenko O, Chan EKF, Minoche AM, Edwards RJ, Barton K, Lyons RJ, Enosi Tuipulotu D, Hayes VM, Omer AD, Colaric Z, Keilwagen J, Skvortsova K, Bogdanovic O, Smith MA, Lieberman Aiden E, Smith TPL, Zammit RA & Ballard JWO (2020): Canfam_GSD: De novo chromosome-length genome assembly of the German Shepherd Dog (Canis lupus familiaris) using a combination of long reads, optical mapping, and Hi-C. GigaScience 9(4):giaa027. doi: 10.1093/gigascience/giaa027
PAFScaff is written in Python 2.x and can be run directly from the commandline:
python $CODEPATH/pafscaff.py [OPTIONS]
If running as part of SLiMSuite, $CODEPATH
will be the SLiMSuite tools/
directory. If running from the standalone PAFScaff git repo, $CODEPATH
will be the path the to code/
directory. Please see details in the PAFScaff git repo
for running on example data.
For mapping prior to parsing, minimap2 must be installed and either added to the
environment $PATH
or given to PAFScaff with the minimap2=PROG
setting.
A list of commandline options can be generated at run-time using the -h
or help
flags. Please see the general
SLiMSuite documentation for details of how to
use commandline options, including setting default values with INI files.
### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
pafin=PAFFILE : PAF generated from $REFERENCE $ASSEMBLY mapping; or run minimap2, or use busco [minimap2]
basefile=STR : Base for file outputs [PAFIN basefile]
seqin=FASFILE : Input genome assembly to map/scaffold onto $REFERENCE (minimap2 $ASSEMBLY) []
reference=FILE : Fasta (with accession numbers matching Locus IDs) ($REFERENCE) []
assembly=FASFILE: As seqin=FASFILE
busco=TSVFILE : BUSCO v5 full table (pafin=busco) [full_table_$BASEFILE.busco.tsv]
refbusco=TSVFILE: Reference BUSCO v5 full table [full_table_$REFBASE.busco.tsv]
refprefix=X : Reference chromosome prefix. If None, will use all $REFERENCE scaffolds [None]
newprefix=X : Assembly chromosome prefix. If None, will not rename $ASSEMBLY scaffolds [None]
unplaced=X : Unplaced scaffold prefix. If None, will not rename unplaced $ASSEMBLY scaffolds [None]
ctgprefix=X : Unplaced contig prefix. Replaces unplaced=X when 0 gaps. [None]
purechrom=T/F : Whetheer to always output the first hit to any chromosome without the numerical suffix [False]
sorted=X : Criterion for $ASSEMBLY scaffold sorting (QryLen/Coverage/RefStart/None) [QryLen]
minmap=PERC : Minimum percentage mapping to a chromosome for assignment [0.0]
minpurity=PERC : Minimum percentage "purity" for assignment to Ref chromosome [50.0]
revcomp=T/F : Whether to reverse complement relevant scaffolds to maximise concordance [True]
scaffold=T/F : Whether to "anchor" non-overlapping scaffolds by Coverage and then scaffold [True]
dochtml=T/F : Generate HTML PAFScaff documentation (*.info.html) instead of main run [False]
pagsat=T/F : Whether to output sequence names in special PAGSAT-compatible format [False]
newchr=X : Prefix for short PAGSAT sequence identifiers [ctg]
spcode=X : Species code for renaming assembly sequences in PAGSAT mode [PAFSCAFF]
### ~ Mapping/Classification options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
minimap2=PROG : Full path to run minimap2 [minimap2]
mmsecnum=INT : Max. number of secondary alignments to keep (minimap2 -N) [0]
mmpcut=NUM : Minimap2 Minimal secondary-to-primary score ratio to output secondary mappings (minimap2 -p) [0]
mapopt=CDICT : Dictionary of additional minimap2 options to apply (caution: over-rides conflicting settings) []
purebusco=T/F : Whether to keep BUSCO genes separate rather than generating synteny blocks [False]
### ~ Processing options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
Details of reference genome naming requirements to be added. Chromosomes should start with the prefix given by
refprefix=X
.
The PAF file is generated by running Minimap2 without secondary alignments. See rje_paf.py
for details.
The PAF file is first parsed and the following fields extracted:
If the refprefix=X
setting was provided, hits are then cleaned up to (a) strip the prefix from chromosome names
and convert numerical chromosomes to numbers for proper sorting, and (b) drop any unplaced reference scaffolds.
If no refprefix
is given, no reference scaffold renaming/filtering will be performed.
For each hit, Coverage
is then calculated as QryEnd
-QryStart
+1. This Coverage
stat is ultimately used for
ranking the assignments of assembly scaffolds to reference chromosomes/scaffolds. For each Qry/Ref/Strand
combination, the Coverage
, Identity
and Length
statistics are summed, and a new field N
added with the
number of individual alignments. The top ranked reference strand is selected for each query, and two more fields
added:
Inv
= Query coverage on the opposite strand to the best reference strandTrans
= Summed query coverage for all other reference chromosomes/scaffolds.Together, these are used to establish the total percentage of query coverage that is scaffolded, versus mapping to a different reference sequence. These are converted into:
Purity
= 100 * (Coverage
+ Inv
) / (Coverage
+ Inv
+ Trans
)v0.4.0 has introduced a couple of additional parameters than can be used to increase the stringency of any
mapping. This is mainly for the purpose of reducing situtations where highly repetitive multi-mapping sequences
are assigned to a single chromosome. By default, minpurity=50.0
, meaning that at least half of the chromosome
mapping should be to the main reference chromosome.
Once queries have been assigned to reference scaffolds, they are then ordered according to the reference
scaffold and start position of the match to that scaffold. Queries are also sorted for the purposes of renaming,
using the sorted=X
statistic. By default, this is query length, but could be switched to Coverage
or RefStart
.
NOTE: This ordering is quite crude and assumes the reference start position is part of the main mapping block. If the query has a fragment mapping to an upstream repeat sequence, for example, this ordering could be messed up. Future releases will check and try to fix this. Visualisation of the mapping is recommended.
Scaffolds mapped to a reference chromosome are designated Placed
. Anchored
scaffolds are a special subset of
Placed
scaffolds, which can be combined into a super-scaffold. Any scaffolds failing to map onto a scaffold
will be designated Unplaced
.
Anchor
assignment is only performed if scaffold=T
. Assembly scaffolds are processed in decreasing size and
assigned to the Anchor
set if either (a) no previous scaffold has been assigned to the mapped reference
chromosome, or (b) the scaffold mapping positions on the reference chromosome do not overlap with any previous
Anchor
scaffold for that chromosome. As with the ordering (above), this scaffolding will get messed up if the
earlier mapped scaffolds have fragments a great distance up or downstream of the main mapped region.
If scaffold=T
, the 'Anchorsequences for each reference chromosome will be scaffolded by concatenating them in reference position order, reverse-complementing where the main mapping was to the negative strand. Gaps will be inserted between scaffolds, consisting of a stretch of
Nn` repeats that matches the size of the gap between
reference positions (e.g. the end of the one match and the start of the next), with a minimum gap of 10 nt.
NOTE: This explicitly expects chromosome-level reference scaffolds. It will not scaffold the reference using the assembly to maximise the overall scaffolding. It is purely for chromosome assignment and then additional scaffolding based on assumed synteny.
If newprefix=X
is given, the sequence name will be replaced with the prefix and reference chromosome identifier
(parsed using refprefix=X
). If multiple assembly sequences map to the same reference chromosome, these will be
numbered .1
, .2
.. .N
. (NOTE: these numbers will have zero prefixes to make them sort correctly, e.g. 10+
sequences will start .01
, 100+ sequences will start .001
etc.
Sequence descriptions take the form:
SEQNAME len=SEQLEN COV% REF(STRAND) START:END; [INV% REF(INVSTRAND);] [OTHER% other;]
where:
SEQNAME
= original assembly sequence name.SEQLEN
= assembly sequence length.COV%
= percentage of query mapping onto reference strand.REF(STRAND)
= reference chromosome and strand. If mapped to the -ve strand, the description will have a RevComp
prefix.START:END
= start and end positions on reference chromosome of query mapping.INV% REF(INVSTRAND)
= percentage mapped onto inverse strand of reference chromosome. (Omitted if 0%.)OTHER%
= percentage mapped onto other reference chromosome. (Omitted if 0%.)Unplaced sequences will either not be renamed, or will have a different prefix if unplaced=X
is set. If
ctgprefix=X
is also set, any unplaced sequences without gaps will have a different prefix set by ctgprefix=X
.
Assembly mapping an scaffolding details will be save to $BASEFILE.scaffolds.tdt
.
Sequences will be output to:
$BASEFILE.anchored.fasta
$BASEFILE.placed.fasta
$BASEFILE.unplaced.fasta
If scaffold=T
, the scaffolded assembly will be saved to $BASEFILE.scaffolded.fasta
and the sequences
summaried in $BASEFILE.scaffolded.tdt
.
© 2019 Richard Edwards | richard.edwards@unsw.edu.au