wtpoa-cns polishing details

V-JJ commented 1 year ago

Good morning!

I would like to ask if you can provide a more explicit explanation about what "wtpoa-cns" does. I've been looking around in literature and internet and a I've only been able to find that it is a Consensuer for wtdbg using PO-MSA

Thanks in advance

ruanjue commented 1 year ago

Generally, it implemented an adaptive-banded SIMD POA on layout of reads along raw contig sequences to correct errors. Please specify which detail.

V-JJ commented 1 year ago

Hello!

We followed this steps (wtdbg2 protocol from the main github page) to assembly and polish our genome: 1) ~57x PacBio reads cov (genome size ~ 1.5 Gb) were assembled with wtdbg2 2) Get consensus 3) Polishing with Pacbio 4) Polishing with Illumina reads

I understand Illumina polishing as a technique to correct the assembly errors at base-level and some small indels.

In addition, I have tried to looked through the abPOA paper and as far as I can understand, it works as follows:

A DP matrix is build (each row - one alignment graph node; each column - one base of the linear seq being aligned) by abPOA. The processing of the matrix includes iterative alignment of sequences to the graph and addition of sequences to the graph based on aligment results. At the end, abPOA generates a consensus sequence from the final aligment graph.

Said that, we were a bit surprised to detect deletions of few hundreds of bases when comparing a scaffold polished only with PacBio reads vs the same scaffold polished with both, PacBio reads and Illumina short reads. On the other hand, there were few base-level changes.

Thus, since we are quite a begginers on the field of graph theory and related stuff, we would like to ask if you could explain how exactly wtpoa-cns works and uses Illumina data to understand:

Why do we observe moderate/large indels?
Why are there few sequence changes?
Is FASTQ base quality considered during polishing? This question is related to a recent abPOA issue Are input QVs meaningful? and improvement that allows abPOA to take base quality scores as nodes' weights.

Many thanks in advance,

ruanjue commented 1 year ago

wtpoa-cns polishs the contig sequence in a manner of window sliding, and concatenates adjacent polished windows using pairwise seq alignment. 1) may be caused by mis-join or other program bug. Please describe more about those deletetions, e.g. size 2) after pacbio polishing, there SHOULD be few base error left 3) I tried base quality in SMARTdenovo (see fast5q), but I think it is not so important in high coverage pacbio long reads and along with illumina short reads.

Hope it helps.

V-JJ commented 1 year ago

Good morning!

Many thanks, it was very helpful!

I forgot to mention that our genome is highly repetitive (~60%).

I have checked what you have asked for: indels size. For that, I used ~400 contigs that represent around 50% of the total genome size.

First, those contigs were aligned (reference - PacBio polished contig; query - PacBio+Illumina polished contig) with minimap2 minimap2 -cx asm5 -t 16 $fasta_dir/$ref_seq $fasta_dir/$query_seq > $out_dir/$output.paf
Second, paftools was used to summarize the INDEL information contained in the last field (CIGAR field) of the paf file. paftools.js stat $out_dir/$output.paf > $out_dir/$output.paf.stats

Here you have our results, considering all the 400 contigs. What are your thoughts on this?	#Case	#Count
Number of substitutions	13928980	34822.45
Number of insertions in [0,50)	3494576	8736.44
Number of insertions in [50,100)	39187	97.9675
Number of insertions in [100,300)	33075	82.6875
Number of insertions in [300,400)	1304	3.26
Number of insertions in [400,1000)	572	1.43
Number of insertions in [1000,inf)	5	0.0125
Number of deletions in [0,50)	2883659	7209.1475
Number of deletions in [50,100)	1366	3.415
Number of deletions in [100,300)	71	0.1775
Number of deletions in [300,400)	0	0
Number of deletions in [400,1000)	0	0
Number of deletions in [1000,inf)	0	0

Thanks,

ruanjue commented 1 year ago

I am surprised with the results, so many differences after illumina polishing. Two things might help: 1, merqury to assess the base quality of those two assemblies, whether 2-polishing improved? BTW, you can try a new evaluation tool GAAP (https://github.com/ruanjue/GAAP) developed in my group. 2, racon or other polishing tools, and compare again to find which is correct.

V-JJ commented 1 year ago

Good morning!

Many thanks for the suggestions!

Here are the Merqury and BUSCO results.

Merqury values don't achieve 30 threshold but at least PB+IL qv is higher than PB qv and the corresponding error rate is lower.
When comparing BUSCO between PB vs PB+IL assemblies, there was a pretty clear pattern of decrease in "Single copy" and increase in both "Fragmented" and "Missing" categories in PB only-polished genome.

PacBio polishing	PacBio + Illumina polishing
Merqury assembly_qv	18.4795 (err: 0.0142)	22.7936 (err: 0.0053)
BUSCO	~90% in relevant datasets	~93% in relevant datasets

PacBio polishing	PacBio + Illumina polishing
ctg1 length	5160833	5079035
Merqury ctg1_qv	19.9081	26.5608
ctg10 length	3787805	3721571
Merqury ctg10_qv	19.0716	26.3825

All things considered, I tend toward the opinion that couple of factors are involved in this: abPOA method (design and perhaps some bugs?) as well as high repetitive content.

Best,

ruanjue commented 1 year ago

I am somehow confused. What bugs of abPOA, another unrelevent program.

V-JJ commented 1 year ago

Good morning!

At the beginning of this issues, you have mentioned this "it implemented an adaptive-banded SIMD POA". I looked for it and found a description (abbreviated as abPOA) here. So, I was referring to that algorithm description before rather than the software itself.

Did I miss something?

Best,

ruanjue commented 1 year ago

Ok, I see. Let's call it wtpoa for distinguishing. If you are studying the consensus, please have a look at the usage.

WTPOA-CNS: Consensuser for wtdbg using PO-MSA
Author: Jue Ruan <ruanjue@gmail.com>
Version: 2.5 (20190621)
Usage: wtpoa-cns [options]
Options:
 -t <int>    Number of threads, [4]
 -d <string> Reference sequences for SAM input, will invoke sorted-SAM input mode
 -u <int>    XORed flags to handle SAM input. [0]
             0x1: Only process reference regions present in/between SAM alignments
             0x2: Don't fileter secondary/supplementary SAM records with flag (0x100 | 0x800)
 -r          Force to use reference mode
 -p <string> Similar with -d, but translate SAM into wtdbg layout file
 -i <string> Input file(s) *.ctg.lay from wtdbg, +, [STDIN]
             Or sorted SAM files when having -d/-p
 -o <string> Output files, [STDOUT]
 -e <string> Output the coordinates of orignal edges in consensus sequences, [NULL]
 -f          Force overwrite
 -j <int>    Expected max length of node, or say the overlap length of two adjacent units in layout file, [1500] bp
             overlap will be default to 1500(or 150 for sam-sr) when having -d/-p, block size will be 2.5 * overlap
 -b <int>    Bonus for tri-bases match, [0]
 -M <int>    Match score, [2]
 -X <int>    Mismatch score, [-5]
 -I <int>    Insertion score, [-2]
 -D <int>    Deletion score, [-4]
 -H <float>  Homopolymer merge score used in dp-call-cns mode, [-3]
 -B <expr>   Bandwidth in POA, [Wmin[,Wmax[,mat_rate]]], mat_rate = matched_bases/total_bases [64,1024,0.92]
             Program will double bandwidth from Wmin to Wmax when mat_rate is lower than setting
 -W <int>    Window size in the middle of the first read for fast align remaining reads, [200]
             If $W is negative, will disable fast align, but use the abs($W) as Band align score cutoff
 -w <int>    Min size of aligned size in window, [$W * 0.5]
 -A          Abort TriPOA when any read cannot be fast aligned, then try POA
 -S <int>    Shuffle mode, 0: don't shuffle reads, 1: by shared kmers, 2: subsampling. [1]
 -R <int>    Realignment bandwidth, 0: disable, [16]
 -c <int>    Consensus mode: 0, run-length; 1, dp-call-cns, [0]
 -C <int>    Min count of bases to call a consensus base, [3]
 -F <float>  Min frequency of non-gap bases to call a consensus base, [0.5]
 -N <int>    Max number of reads in PO-MSA [20]
             Keep in mind that I am not going to generate high accurate consensus sequences here
 -x <string> Presets, []
             sam-sr: polishs contigs from short reads mapping, accepts sorted SAM files
                     shorted for '-j 50 -W 0 -R 0 -b 1 -c 1 -N 50 -rS 2'
 -q          Quiet
 -v          Verbose
 -V          Print version information and then exit

The major difference bewteen wtpoa and other POA-consensuser is Tri-POA strategy, see -W. To get accurate results, please try to increase -N and -R. Also pay atention to -B, it can disable banded-POA.

Jue

ruanjue / wtdbg2

wtpoa-cns polishing details #249