Open V-JJ opened 1 year ago
Generally, it implemented an adaptive-banded SIMD POA on layout of reads along raw contig sequences to correct errors. Please specify which detail.
Hello!
We followed this steps (wtdbg2 protocol from the main github page) to assembly and polish our genome: 1) ~57x PacBio reads cov (genome size ~ 1.5 Gb) were assembled with wtdbg2 2) Get consensus 3) Polishing with Pacbio 4) Polishing with Illumina reads
I understand Illumina polishing as a technique to correct the assembly errors at base-level and some small indels.
In addition, I have tried to looked through the abPOA paper and as far as I can understand, it works as follows:
Said that, we were a bit surprised to detect deletions of few hundreds of bases when comparing a scaffold polished only with PacBio reads vs the same scaffold polished with both, PacBio reads and Illumina short reads. On the other hand, there were few base-level changes.
Thus, since we are quite a begginers on the field of graph theory and related stuff, we would like to ask if you could explain how exactly wtpoa-cns works and uses Illumina data to understand:
Many thanks in advance,
wtpoa-cns polishs the contig sequence in a manner of window sliding, and concatenates adjacent polished windows using pairwise seq alignment. 1) may be caused by mis-join or other program bug. Please describe more about those deletetions, e.g. size 2) after pacbio polishing, there SHOULD be few base error left 3) I tried base quality in SMARTdenovo (see fast5q), but I think it is not so important in high coverage pacbio long reads and along with illumina short reads.
Hope it helps.
Good morning!
Many thanks, it was very helpful!
I forgot to mention that our genome is highly repetitive (~60%).
I have checked what you have asked for: indels size. For that, I used ~400 contigs that represent around 50% of the total genome size.
minimap2 -cx asm5 -t 16 $fasta_dir/$ref_seq $fasta_dir/$query_seq > $out_dir/$output.paf
paftools.js stat $out_dir/$output.paf > $out_dir/$output.paf.stats
Here you have our results, considering all the 400 contigs. What are your thoughts on this? | #Case | #Count | #Average (per contig) |
---|---|---|---|
Number of substitutions | 13928980 | 34822.45 | |
Number of insertions in [0,50) | 3494576 | 8736.44 | |
Number of insertions in [50,100) | 39187 | 97.9675 | |
Number of insertions in [100,300) | 33075 | 82.6875 | |
Number of insertions in [300,400) | 1304 | 3.26 | |
Number of insertions in [400,1000) | 572 | 1.43 | |
Number of insertions in [1000,inf) | 5 | 0.0125 | |
Number of deletions in [0,50) | 2883659 | 7209.1475 | |
Number of deletions in [50,100) | 1366 | 3.415 | |
Number of deletions in [100,300) | 71 | 0.1775 | |
Number of deletions in [300,400) | 0 | 0 | |
Number of deletions in [400,1000) | 0 | 0 | |
Number of deletions in [1000,inf) | 0 | 0 |
Thanks,
I am surprised with the results, so many differences after illumina polishing. Two things might help: 1, merqury to assess the base quality of those two assemblies, whether 2-polishing improved? BTW, you can try a new evaluation tool GAAP (https://github.com/ruanjue/GAAP) developed in my group. 2, racon or other polishing tools, and compare again to find which is correct.
Good morning!
Many thanks for the suggestions!
Here are the Merqury and BUSCO results.
PacBio polishing | PacBio + Illumina polishing | |
---|---|---|
Merqury assembly_qv | 18.4795 (err: 0.0142) | 22.7936 (err: 0.0053) |
BUSCO | ~90% in relevant datasets | ~93% in relevant datasets |
PacBio polishing | PacBio + Illumina polishing | |
---|---|---|
ctg1 length | 5160833 | 5079035 |
Merqury ctg1_qv | 19.9081 | 26.5608 |
ctg10 length | 3787805 | 3721571 |
Merqury ctg10_qv | 19.0716 | 26.3825 |
All things considered, I tend toward the opinion that couple of factors are involved in this: abPOA method (design and perhaps some bugs?) as well as high repetitive content.
Best,
I am somehow confused. What bugs of abPOA
, another unrelevent program.
Good morning!
At the beginning of this issues, you have mentioned this "it implemented an adaptive-banded SIMD POA". I looked for it and found a description (abbreviated as abPOA) here. So, I was referring to that algorithm description before rather than the software itself.
Did I miss something?
Best,
Ok, I see. Let's call it wtpoa
for distinguishing. If you are studying the consensus, please have a look at the usage.
WTPOA-CNS: Consensuser for wtdbg using PO-MSA
Author: Jue Ruan <ruanjue@gmail.com>
Version: 2.5 (20190621)
Usage: wtpoa-cns [options]
Options:
-t <int> Number of threads, [4]
-d <string> Reference sequences for SAM input, will invoke sorted-SAM input mode
-u <int> XORed flags to handle SAM input. [0]
0x1: Only process reference regions present in/between SAM alignments
0x2: Don't fileter secondary/supplementary SAM records with flag (0x100 | 0x800)
-r Force to use reference mode
-p <string> Similar with -d, but translate SAM into wtdbg layout file
-i <string> Input file(s) *.ctg.lay from wtdbg, +, [STDIN]
Or sorted SAM files when having -d/-p
-o <string> Output files, [STDOUT]
-e <string> Output the coordinates of orignal edges in consensus sequences, [NULL]
-f Force overwrite
-j <int> Expected max length of node, or say the overlap length of two adjacent units in layout file, [1500] bp
overlap will be default to 1500(or 150 for sam-sr) when having -d/-p, block size will be 2.5 * overlap
-b <int> Bonus for tri-bases match, [0]
-M <int> Match score, [2]
-X <int> Mismatch score, [-5]
-I <int> Insertion score, [-2]
-D <int> Deletion score, [-4]
-H <float> Homopolymer merge score used in dp-call-cns mode, [-3]
-B <expr> Bandwidth in POA, [Wmin[,Wmax[,mat_rate]]], mat_rate = matched_bases/total_bases [64,1024,0.92]
Program will double bandwidth from Wmin to Wmax when mat_rate is lower than setting
-W <int> Window size in the middle of the first read for fast align remaining reads, [200]
If $W is negative, will disable fast align, but use the abs($W) as Band align score cutoff
-w <int> Min size of aligned size in window, [$W * 0.5]
-A Abort TriPOA when any read cannot be fast aligned, then try POA
-S <int> Shuffle mode, 0: don't shuffle reads, 1: by shared kmers, 2: subsampling. [1]
-R <int> Realignment bandwidth, 0: disable, [16]
-c <int> Consensus mode: 0, run-length; 1, dp-call-cns, [0]
-C <int> Min count of bases to call a consensus base, [3]
-F <float> Min frequency of non-gap bases to call a consensus base, [0.5]
-N <int> Max number of reads in PO-MSA [20]
Keep in mind that I am not going to generate high accurate consensus sequences here
-x <string> Presets, []
sam-sr: polishs contigs from short reads mapping, accepts sorted SAM files
shorted for '-j 50 -W 0 -R 0 -b 1 -c 1 -N 50 -rS 2'
-q Quiet
-v Verbose
-V Print version information and then exit
The major difference bewteen wtpoa and other POA-consensuser is Tri-POA strategy, see -W
. To get accurate results, please try to increase -N
and -R
. Also pay atention to -B
, it can disable banded-POA.
Jue
Good morning!
I would like to ask if you can provide a more explicit explanation about what "wtpoa-cns" does. I've been looking around in literature and internet and a I've only been able to find that it is a Consensuer for wtdbg using PO-MSA
Thanks in advance