ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

about the problem of wtdbg2 assembly #108

Closed wangzhongkai1 closed 5 years ago

wangzhongkai1 commented 5 years ago

Dear jue,

Thank you for your WTDBG genome assembly tool. But sorry to tell you that I have been troubled by the problem for a long time which I thought caused by the computer resource at first. BUT when I changed to another larger supercomputer configured with 104 cores , 1.4 T memory , the problem appeared again no matter how I changed the parameter.

the problems are all like this:"wtdbg2: kbm.h:569: split_FIXP_kmers_kbm: Assertion `rs[0]->n_head > 0' failed." parameters are as followed: "-p 17 -k 3 -AS 4 -K 0.05 -s 0.5 -t 100" . I have tried different parameter sets like "-p 19 -k 3","-p 17 -k 3","-p 15 -k 3","-p 10 -k 10" and etc. .

could you please help me debugging the problem?

best wishes, Zhongkai Wang , NWPU

wangzhongkai1 commented 5 years ago

by the way,I have already tried the default parameters whereas I want a better results @ruanjue

ruanjue commented 5 years ago

I found two sources might cause this problem, see https://github.com/ruanjue/wtdbg2/commit/71ab37f9774f9142b5023519b0ffe64c3536a959 . Please try the latest commit and tell me whether the error raise again.

Thanks, Jue

wangzhongkai1 commented 5 years ago

thanks for your reply, and glad to tell you that the problem vanished! but I found it that v2.4 is much slower than v2.1(I chose v2.1 because the server can only run this version of wtdbg if I set K unequal to 0 ) ,so can yo pelease tell me the main difference between these two versions? And it seems like the results of v2.1 is better than that of 2.4,so which one is more believable?

best wishes, kai On 5/14/2019 14:48,Jue Ruannotifications@github.com wrote:

I found two sources might cause this problem, see 71ab37f . Please try the latest commit and tell me whether the error raise again.

Thanks, Jue

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

ruanjue commented 5 years ago

Thanks for the good news.

From v2.1 to v2.4, the major improvement was supporting huge genomes. No limit on genome size and data size; Compressed output to save hard-disk. Also introduce realignment modue. etc. So, I suggest to use v2.4. The problem of slowing had beed reported in other issue, and fixed.

Jue

wangzhongkai1 commented 5 years ago

thank you , honestly speaking, I have always been aware of the update of your sortware because my program group have always been using wtdbg since version 1.2 and It really does a lot favor to our projects and I should really thank yo again for your soft and wisdom. but as i claimed in last email , it's lamost 15 times slower(v2.4) than v2.1 . for example, for a genome estimated 1.4g, it took only 1 hour to finished the whole wtdbg using v2.1, but it generated only 1/4 alignments.gz file in nearly 4 hours

On 5/14/2019 15:48,Jue Ruannotifications@github.com wrote:

Thanks for the good news.

From v2.1 to v2.4, the major improvement was supporting huge genomes. No limit on genome size and data size; Compressed output to save hard-disk. Also introduce realignment modue. etc. So, I suggest to use v2.4. The problem of slowing had beed reported in other issue, and fixed.

Jue

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

ruanjue commented 5 years ago

Ok, I will compare v2.1 and v2.4 on a small genome, and give feedback.

wangzhongkai1 commented 5 years ago

looking forward to you reply!

thank you~

On 5/14/2019 16:21,Jue Ruannotifications@github.com wrote:

Ok, I will compare v2.1 and v2.4 on a small genome, and give feedback.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

ruanjue commented 5 years ago

On C.elegans dataset, v2.4 took a bit more runtime, but can be explained by compressing output. Were you using the released v.2.4 or latest commit, please have a look at https://github.com/ruanjue/wtdbg2/issues/107 .

wangzhongkai1 commented 5 years ago

Dear you,

I have checked the #107 to find that you changed the step of _timeout.tv_nsec from 10000 to 1000000 ; and then I checked my format of thread.h only to find that it's exactly the same as new one ,1000000. Since you think it not strange, I plan to continue to use v2.1 and please tell me if I can trust the result of v2.1 assembly genome? Afterall, we must consider both accuracy and time cost.

best wishes, Zhongkai Wang , NWPU

On 5/14/2019 19:44,Jue Ruannotifications@github.com wrote:

On C.elegans dataset, v2.4 took a bit more runtime, but can be explained by compressing output. Were you using the released v.2.4 or latest commit, please have a look at #107 .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

ruanjue commented 5 years ago

On human CHM1 PacBio dataset:

V2.1
wtdbg2.1 -t 96 -i ../rawdata/rawdata.fa -p 21 -S 4 -L 5000 -fo dbg
** PROC_STAT(TOTAL) **: real 19116.111 sec, user 739810.620 sec, sys 65993.690 sec, maxrss 301417272.0 kB, maxvsize 336782628.0 kB
V2.4
wtdbg2.4 -t 96 -i rawdata.fa -fo dbg -x rs -g 3g
** PROC_STAT(TOTAL) **: real 10496.099 sec, user 384599.160 sec, sys 98576.170 sec, maxrss 231329464.0 kB, maxvsize 254211140.0 kB

In addtion, V2.4 got a better N50 value than V2.1.

Best, Jue