slimsuite / pafscaff

PAFScaff: Pairwise mApping Format reference-based scaffold anchoring and super-scaffolding.
GNU General Public License v3.0
7 stars 0 forks source link

PAFScaff: Pairwise mApping Format reference-based scaffold anchoring and super-scaffolding

PAFScaff v0.7.1

For a better rendering and navigation of this document, please download and open ./docs/pafscaff.docs.html, or visit https://slimsuite.github.io/pafscaff/. Documentation can also be generated by running PAFScaff with the dochtml=T option. (R and pandoc must be installed - see below.)

Introduction

PAFScaff is designed for mapping genome assembly scaffolds to a closely-related chromosome-level reference genome assembly. It uses (or runs) Minimap2 to perform an efficient (if rough) all- against-all mapping, then parses the output to assign assembly scaffolds to reference chromosomes.

Mapping is based on minimap2-aligned assembly scaffold ("Query") coverage against the reference chromosomes. Scaffolds are "placed" on the reference scaffold with most coverage. Any scaffolds failing to map onto any chromosome are rated as "Unplaced". For each reference chromosome, PAFScaff then "anchors" placed assembly scaffolds starting with the longest assembly scaffold. Each placed scaffold is then assessed in order of decreasing scaffold length. Any scaffolds that do not overlap with already anchored scaffolds in terms of the Reference chromosome positions they map onto are also considered "Anchored". if newprefix=X is set, scaffolds are renamed with the Reference chromosome they match onto. The original scaffold name and mapping details are included in the description. Unplaced scaffolds are not renamed.

Finally, Anchored scaffolds are super-scaffolded by inserting gaps of NnNnNnNnNn sequence between anchored scaffolds. The lengths of these gaps are determined by the space between the reference positions, modified by overhanging query scaffold regions (min. length 10). The alternating case of these gaps makes them easy to identify later.

Output

PAFScaff outputs renamed, sorted and reoriented scaffolds in fasta format, along with mapping details:

NOTE: The precise ordering, orientation and naming of the output scaffolds depends on the settings for: refprefix=X newprefix=X sorted=T/F revcomp=T/F. See main documentation (below) for details.

Citing PAFScaff

The main minimap-based PAFScaff approach has been published as part of the German Shepherd Dog genome paper:

Field MA, Rosen BD, Dudchenko O, Chan EKF, Minoche AM, Edwards RJ, Barton K, Lyons RJ, Enosi Tuipulotu D, Hayes VM, Omer AD, Colaric Z, Keilwagen J, Skvortsova K, Bogdanovic O, Smith MA, Lieberman Aiden E, Smith TPL, Zammit RA & Ballard JWO (2020): Canfam_GSD: De novo chromosome-length genome assembly of the German Shepherd Dog (Canis lupus familiaris) using a combination of long reads, optical mapping, and Hi-C. GigaScience 9(4):giaa027. doi: 10.1093/gigascience/giaa027


Running PAFScaff

PAFScaff is written in Python 2.x and can be run directly from the commandline:

python $CODEPATH/pafscaff.py [OPTIONS]

If running as part of SLiMSuite, $CODEPATH will be the SLiMSuite tools/ directory. If running from the standalone PAFScaff git repo, $CODEPATH will be the path the to code/ directory. Please see details in the PAFScaff git repo for running on example data.

For mapping prior to parsing, minimap2 must be installed and either added to the environment $PATH or given to PAFScaff with the minimap2=PROG setting.

Commandline options

A list of commandline options can be generated at run-time using the -h or help flags. Please see the general SLiMSuite documentation for details of how to use commandline options, including setting default values with INI files.

### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
pafin=PAFFILE   : PAF generated from $REFERENCE $ASSEMBLY mapping; or run minimap2, or use busco [minimap2]
basefile=STR    : Base for file outputs [PAFIN basefile]
seqin=FASFILE   : Input genome assembly to map/scaffold onto $REFERENCE (minimap2 $ASSEMBLY) []
reference=FILE  : Fasta (with accession numbers matching Locus IDs) ($REFERENCE) []
assembly=FASFILE: As seqin=FASFILE
busco=TSVFILE   : BUSCO v5 full table (pafin=busco) [full_table_$BASEFILE.busco.tsv]
refbusco=TSVFILE: Reference BUSCO v5 full table [full_table_$REFBASE.busco.tsv]
refprefix=X     : Reference chromosome prefix. If None, will use all $REFERENCE scaffolds [None]
newprefix=X     : Assembly chromosome prefix. If None, will not rename $ASSEMBLY scaffolds [None]
unplaced=X      : Unplaced scaffold prefix. If None, will not rename unplaced $ASSEMBLY scaffolds [None]
ctgprefix=X     : Unplaced contig prefix. Replaces unplaced=X when 0 gaps. [None]
purechrom=T/F   : Whetheer to always output the first hit to any chromosome without the numerical suffix [False]
sorted=X        : Criterion for $ASSEMBLY scaffold sorting (QryLen/Coverage/RefStart/None) [QryLen]
minmap=PERC     : Minimum percentage mapping to a chromosome for assignment [0.0]
minpurity=PERC  : Minimum percentage "purity" for assignment to Ref chromosome [50.0]
revcomp=T/F     : Whether to reverse complement relevant scaffolds to maximise concordance [True]
scaffold=T/F    : Whether to "anchor" non-overlapping scaffolds by Coverage and then scaffold [True]
dochtml=T/F     : Generate HTML PAFScaff documentation (*.info.html) instead of main run [False]
pagsat=T/F      : Whether to output sequence names in special PAGSAT-compatible format [False]
newchr=X        : Prefix for short PAGSAT sequence identifiers [ctg]
spcode=X        : Species code for renaming assembly sequences in PAGSAT mode [PAFSCAFF]
### ~ Mapping/Classification options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
minimap2=PROG   : Full path to run minimap2 [minimap2]
mmsecnum=INT    : Max. number of secondary alignments to keep (minimap2 -N) [0]
mmpcut=NUM      : Minimap2 Minimal secondary-to-primary score ratio to output secondary mappings (minimap2 -p) [0]
mapopt=CDICT    : Dictionary of additional minimap2 options to apply (caution: over-rides conflicting settings) []
purebusco=T/F   : Whether to keep BUSCO genes separate rather than generating synteny blocks [False]
### ~ Processing options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X         : Number of parallel sequences to process at once [0]
killforks=X     : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X     : Sleep time (seconds) between cycles of forking out more process [0]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

PAFScaff input and output

Reference genome naming

Details of reference genome naming requirements to be added. Chromosomes should start with the prefix given by refprefix=X.

Generating the PAF input file

The PAF file is generated by running Minimap2 without secondary alignments. See rje_paf.py for details.

Anchoring and mapping

The PAF file is first parsed and the following fields extracted:

If the refprefix=X setting was provided, hits are then cleaned up to (a) strip the prefix from chromosome names and convert numerical chromosomes to numbers for proper sorting, and (b) drop any unplaced reference scaffolds. If no refprefix is given, no reference scaffold renaming/filtering will be performed.

For each hit, Coverage is then calculated as QryEnd-QryStart+1. This Coverage stat is ultimately used for ranking the assignments of assembly scaffolds to reference chromosomes/scaffolds. For each Qry/Ref/Strand combination, the Coverage, Identity and Length statistics are summed, and a new field N added with the number of individual alignments. The top ranked reference strand is selected for each query, and two more fields added:

Together, these are used to establish the total percentage of query coverage that is scaffolded, versus mapping to a different reference sequence. These are converted into:

v0.4.0 has introduced a couple of additional parameters than can be used to increase the stringency of any mapping. This is mainly for the purpose of reducing situtations where highly repetitive multi-mapping sequences are assigned to a single chromosome. By default, minpurity=50.0, meaning that at least half of the chromosome mapping should be to the main reference chromosome.

Once queries have been assigned to reference scaffolds, they are then ordered according to the reference scaffold and start position of the match to that scaffold. Queries are also sorted for the purposes of renaming, using the sorted=X statistic. By default, this is query length, but could be switched to Coverage or RefStart.

NOTE: This ordering is quite crude and assumes the reference start position is part of the main mapping block. If the query has a fragment mapping to an upstream repeat sequence, for example, this ordering could be messed up. Future releases will check and try to fix this. Visualisation of the mapping is recommended.

Anchoring and Scaffolding

Scaffolds mapped to a reference chromosome are designated Placed. Anchored scaffolds are a special subset of Placed scaffolds, which can be combined into a super-scaffold. Any scaffolds failing to map onto a scaffold will be designated Unplaced.

Anchor assignment is only performed if scaffold=T. Assembly scaffolds are processed in decreasing size and assigned to the Anchor set if either (a) no previous scaffold has been assigned to the mapped reference chromosome, or (b) the scaffold mapping positions on the reference chromosome do not overlap with any previous Anchor scaffold for that chromosome. As with the ordering (above), this scaffolding will get messed up if the earlier mapped scaffolds have fragments a great distance up or downstream of the main mapped region.

If scaffold=T, the 'Anchorsequences for each reference chromosome will be scaffolded by concatenating them in reference position order, reverse-complementing where the main mapping was to the negative strand. Gaps will be inserted between scaffolds, consisting of a stretch ofNn` repeats that matches the size of the gap between reference positions (e.g. the end of the one match and the start of the next), with a minimum gap of 10 nt.

NOTE: This explicitly expects chromosome-level reference scaffolds. It will not scaffold the reference using the assembly to maximise the overall scaffolding. It is purely for chromosome assignment and then additional scaffolding based on assumed synteny.

Sequence naming

If newprefix=X is given, the sequence name will be replaced with the prefix and reference chromosome identifier (parsed using refprefix=X). If multiple assembly sequences map to the same reference chromosome, these will be numbered .1, .2 .. .N. (NOTE: these numbers will have zero prefixes to make them sort correctly, e.g. 10+ sequences will start .01, 100+ sequences will start .001 etc.

Sequence descriptions take the form:

SEQNAME len=SEQLEN COV% REF(STRAND) START:END; [INV% REF(INVSTRAND);] [OTHER% other;]

where:

Unplaced sequences will either not be renamed, or will have a different prefix if unplaced=X is set. If ctgprefix=X is also set, any unplaced sequences without gaps will have a different prefix set by ctgprefix=X.

Output

Assembly mapping an scaffolding details will be save to $BASEFILE.scaffolds.tdt.

Sequences will be output to:

If scaffold=T, the scaffolded assembly will be saved to $BASEFILE.scaffolded.fasta and the sequences summaried in $BASEFILE.scaffolded.tdt.


© 2019 Richard Edwards | richard.edwards@unsw.edu.au