ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
504 stars 92 forks source link

Consensus step : genome size lower than pre-consensus stage #231

Open shri1984 opened 3 years ago

shri1984 commented 3 years ago

Hi, I am getting 12% less bases to post consensus for my genome (complex and big, 100X coverage). I have checked is there missing contigs between the lay and cns.raw.fa file. I see no missing contigs. I just wonder what is driving this? or is it normal to loose that much bases in the consensus stage? I also wonder are there any parameters in wtpoa-cns I can tweak? Thank you.

ruanjue commented 3 years ago

The first step is to check whether the polished contigs are more accurate using WGS short reads.

shri1984 commented 3 years ago

I did that. I use polca.sh for polshing. I use 1 billion illumina PE reads (150 PE). this is the report Substitution Errors: 991295 Insertion/Deletion Errors: 663392 Assembly Size: 5498989353 Consensus Quality: 99.9699

ruanjue commented 3 years ago

So, the reason should be many repeats were collapsed in assembling, not the problem of wtpoa-cns. One option is to add -R to wtdbg2, which will be 2X slower. Another opition, try to use flye or other assembler on this dataset, and find the best assembly.

shri1984 commented 3 years ago

Thanks. I see. I used the options you suggested (-R, aln-dovetail -1 or 1024, -l 500 etc, K 2000) for repetitive genomes (in issue #230). It worked beautifully, but things go wrong in cns stage. is there any other wtpoa-cns like the consensus calling tool I can try and compare?

ruanjue commented 3 years ago

In wtdbg2 step, the assembly size was stated by uncorrected seqeunce length, usually will become smaller after wtpoa-cns.

shri1984 commented 3 years ago

Do you know what is acceptable limit for this reduction? in my case it is 12 %. data is coming from 7 cells of sequel CLR. I am also using RS preset. it started to become good with this preset. Again I got this info from other issues you addressed here. so you think I have no way out of this problem?

ruanjue commented 3 years ago

If the genome size was correctly estimated and the genome was complicated, maybe there is no way. However, please find out some contigs that differed much in size between before and after polishing, then align their CLR long reads to their consensus sequences to see whether there were big insertion/deletions. If found many such cases, there should be errors when wtpoa-cns concatenates cns seq pieces.

goblin290272908 commented 2 years ago

Hi Thank you for providing such excellent tools. We rely on it to assemble the genome using ccs data. At present, for our data, its result is obviously better than hifiasm. Using the default parameters (and -g 1.3g), the direct output quality reaches 1892 contings and the N50 reaches 3M. The evaluation of busco reached more than 95%. However, the genome size is still too small compared with the estimated size, and only 880m assembly is obtained. How can we adjust the parameters so that our results are close to the estimated genome size? thank you!

ruanjue commented 2 years ago

wtdbg2 tends to collapse similar regions. For your case, please try increase '-s 0.5' to '-s 0.8' or others.

goblin290272908 commented 2 years ago

thank you very much!I will try it.