ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

a bug for --ctg-min-length? #41

Closed bitcometz closed 6 years ago

bitcometz commented 6 years ago

hello, I found there were always one single sequence in the assembly lower than the minimal contig length(5k), such as 1.7k, 4k, ... for each project.

--ctg-min-length Min length of contigs to be output, 5000

Thanks

ruanjue commented 6 years ago

It is caused by wtpoa-cns failed to get consensus sequence in low coverage region. I have checked one contig less than 5k, the original layout in prefix.ctg.lay is ok, but failed in wtpoa-cns.

I am revising wtpoa-cns, will generate 'consensus' even the read coverage less than 3.

Jue

bitcometz commented 6 years ago

Thanks !!!

DexinBo commented 4 years ago

Hello, I am using Wtdbg2.5 to assemble a worm genome about 130m, heterozygosity ~ 1% I tried several times and met the same problem about the --ctg-min-legth 5000 ,but I can still find some short contigs(about 400bp-1000bp) from prefix.raw.fa and prefix.cns.fa I also checked prefix.ctg.lay.gz but there is no one contig legth < 5000bp ,so I think the main reason is still the process wtpoa-cns,can you help me to solve it? Thank you!

Best, Bo

ruanjue commented 4 years ago

Yes, a contig with layout length of >= 5000bp, but may get sequences less than 5k after wtpoa-cns. Reasons:1 variant in estimation of contig length; 2, some layout may fail to call consensus for full length. Please filter shorter contigs after wtpoa-cns.

DexinBo commented 4 years ago

Yes, a contig with layout length of >= 5000bp, but may get sequences less than 5k after wtpoa-cns. Reasons:1 variant in estimation of contig length; 2, some layout may fail to call consensus for full length. Please filter shorter contigs after wtpoa-cns.

Thank you, I will try to do it the way you said. Another question: When I use SMARTdenovo to assemble different coverage of reads ,like 30x 50x 80x 100x and all_reads ,I got separately 123m 139m 158m 170m 181m ; Compare to Wtdbg2 to assemble 30x 50x 80x 100x and all_reads , I got 133m 136m 136m 136m 141m ; Taking into account the heterozygosity of the genome,I think this genome might be bigger than expected size,so what kind of dateset and what kind of software do you prefer? PS: I tried Canu, Falcon, Flye, Mecat2 ...,and only your software can provide the best result, at least genome size, N50. BTW, usually SMARTdenovo assemble 400-800 contigs ,but Wtdbg2 assemble 1700-2200 contigs which N90-N100 contains about 1000 contigs ,these small contig resulting in some difficult in Hi-C scaffloding,so can I simply trust SMARTdenovo to get longer contigs or use Wtdbg2 and filt small contigs?

Sorry for bothering :)

Best , Bo