ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
497 stars 91 forks source link

wtpoa-cns polishing details #249

Open V-JJ opened 1 year ago

V-JJ commented 1 year ago

Good morning!

I would like to ask if you can provide a more explicit explanation about what "wtpoa-cns" does. I've been looking around in literature and internet and a I've only been able to find that it is a Consensuer for wtdbg using PO-MSA

Thanks in advance

ruanjue commented 1 year ago

Generally, it implemented an adaptive-banded SIMD POA on layout of reads along raw contig sequences to correct errors. Please specify which detail.

V-JJ commented 1 year ago

Hello!

We followed this steps (wtdbg2 protocol from the main github page) to assembly and polish our genome: 1) ~57x PacBio reads cov (genome size ~ 1.5 Gb) were assembled with wtdbg2 2) Get consensus 3) Polishing with Pacbio 4) Polishing with Illumina reads

I understand Illumina polishing as a technique to correct the assembly errors at base-level and some small indels.

In addition, I have tried to looked through the abPOA paper and as far as I can understand, it works as follows:

Said that, we were a bit surprised to detect deletions of few hundreds of bases when comparing a scaffold polished only with PacBio reads vs the same scaffold polished with both, PacBio reads and Illumina short reads. On the other hand, there were few base-level changes.

Thus, since we are quite a begginers on the field of graph theory and related stuff, we would like to ask if you could explain how exactly wtpoa-cns works and uses Illumina data to understand:

Many thanks in advance,

ruanjue commented 1 year ago

wtpoa-cns polishs the contig sequence in a manner of window sliding, and concatenates adjacent polished windows using pairwise seq alignment. 1) may be caused by mis-join or other program bug. Please describe more about those deletetions, e.g. size 2) after pacbio polishing, there SHOULD be few base error left 3) I tried base quality in SMARTdenovo (see fast5q), but I think it is not so important in high coverage pacbio long reads and along with illumina short reads.

Hope it helps.

V-JJ commented 1 year ago

Good morning!

Many thanks, it was very helpful!

I forgot to mention that our genome is highly repetitive (~60%).

I have checked what you have asked for: indels size. For that, I used ~400 contigs that represent around 50% of the total genome size.

Here you have our results, considering all the 400 contigs. What are your thoughts on this? #Case #Count #Average (per contig)
Number of substitutions 13928980 34822.45
Number of insertions in [0,50) 3494576 8736.44
Number of insertions in [50,100) 39187 97.9675
Number of insertions in [100,300) 33075 82.6875
Number of insertions in [300,400) 1304 3.26
Number of insertions in [400,1000) 572 1.43
Number of insertions in [1000,inf) 5 0.0125
Number of deletions in [0,50) 2883659 7209.1475
Number of deletions in [50,100) 1366 3.415
Number of deletions in [100,300) 71 0.1775
Number of deletions in [300,400) 0 0
Number of deletions in [400,1000) 0 0
Number of deletions in [1000,inf) 0 0

Thanks,

ruanjue commented 1 year ago

I am surprised with the results, so many differences after illumina polishing. Two things might help: 1, merqury to assess the base quality of those two assemblies, whether 2-polishing improved? BTW, you can try a new evaluation tool GAAP (https://github.com/ruanjue/GAAP) developed in my group. 2, racon or other polishing tools, and compare again to find which is correct.

V-JJ commented 1 year ago

Good morning!

Many thanks for the suggestions!

Here are the Merqury and BUSCO results.

PacBio polishing PacBio + Illumina polishing
Merqury assembly_qv 18.4795 (err: 0.0142) 22.7936 (err: 0.0053)
BUSCO ~90% in relevant datasets ~93% in relevant datasets
PacBio polishing PacBio + Illumina polishing
ctg1 length 5160833 5079035
Merqury ctg1_qv 19.9081 26.5608
ctg10 length 3787805 3721571
Merqury ctg10_qv 19.0716 26.3825

All things considered, I tend toward the opinion that couple of factors are involved in this: abPOA method (design and perhaps some bugs?) as well as high repetitive content.

Best,

ruanjue commented 1 year ago

I am somehow confused. What bugs of abPOA, another unrelevent program.

V-JJ commented 1 year ago

Good morning!

At the beginning of this issues, you have mentioned this "it implemented an adaptive-banded SIMD POA". I looked for it and found a description (abbreviated as abPOA) here. So, I was referring to that algorithm description before rather than the software itself.

Did I miss something?

Best,

ruanjue commented 1 year ago

Ok, I see. Let's call it wtpoa for distinguishing. If you are studying the consensus, please have a look at the usage.

WTPOA-CNS: Consensuser for wtdbg using PO-MSA
Author: Jue Ruan <ruanjue@gmail.com>
Version: 2.5 (20190621)
Usage: wtpoa-cns [options]
Options:
 -t <int>    Number of threads, [4]
 -d <string> Reference sequences for SAM input, will invoke sorted-SAM input mode
 -u <int>    XORed flags to handle SAM input. [0]
             0x1: Only process reference regions present in/between SAM alignments
             0x2: Don't fileter secondary/supplementary SAM records with flag (0x100 | 0x800)
 -r          Force to use reference mode
 -p <string> Similar with -d, but translate SAM into wtdbg layout file
 -i <string> Input file(s) *.ctg.lay from wtdbg, +, [STDIN]
             Or sorted SAM files when having -d/-p
 -o <string> Output files, [STDOUT]
 -e <string> Output the coordinates of orignal edges in consensus sequences, [NULL]
 -f          Force overwrite
 -j <int>    Expected max length of node, or say the overlap length of two adjacent units in layout file, [1500] bp
             overlap will be default to 1500(or 150 for sam-sr) when having -d/-p, block size will be 2.5 * overlap
 -b <int>    Bonus for tri-bases match, [0]
 -M <int>    Match score, [2]
 -X <int>    Mismatch score, [-5]
 -I <int>    Insertion score, [-2]
 -D <int>    Deletion score, [-4]
 -H <float>  Homopolymer merge score used in dp-call-cns mode, [-3]
 -B <expr>   Bandwidth in POA, [Wmin[,Wmax[,mat_rate]]], mat_rate = matched_bases/total_bases [64,1024,0.92]
             Program will double bandwidth from Wmin to Wmax when mat_rate is lower than setting
 -W <int>    Window size in the middle of the first read for fast align remaining reads, [200]
             If $W is negative, will disable fast align, but use the abs($W) as Band align score cutoff
 -w <int>    Min size of aligned size in window, [$W * 0.5]
 -A          Abort TriPOA when any read cannot be fast aligned, then try POA
 -S <int>    Shuffle mode, 0: don't shuffle reads, 1: by shared kmers, 2: subsampling. [1]
 -R <int>    Realignment bandwidth, 0: disable, [16]
 -c <int>    Consensus mode: 0, run-length; 1, dp-call-cns, [0]
 -C <int>    Min count of bases to call a consensus base, [3]
 -F <float>  Min frequency of non-gap bases to call a consensus base, [0.5]
 -N <int>    Max number of reads in PO-MSA [20]
             Keep in mind that I am not going to generate high accurate consensus sequences here
 -x <string> Presets, []
             sam-sr: polishs contigs from short reads mapping, accepts sorted SAM files
                     shorted for '-j 50 -W 0 -R 0 -b 1 -c 1 -N 50 -rS 2'
 -q          Quiet
 -v          Verbose
 -V          Print version information and then exit

The major difference bewteen wtpoa and other POA-consensuser is Tri-POA strategy, see -W. To get accurate results, please try to increase -N and -R. Also pay atention to -B, it can disable banded-POA.

Jue