ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

Get different number of contigs when using single vs. multiple cores (local assembly) #181

Closed shunhuahan closed 4 years ago

shunhuahan commented 4 years ago

Hi,

Thanks for making this great tool! I was using wtdbg2 for some local assemblies (rs2 data) with estimated genome size for each assembly job to be ~30kb. Also, each assembly should contain transposable element in the middle (<10kb). Looks like wtdbg2 would derive empty contig in quite a few cases. I might need to tweak some parameters for this small genome size, but what surprised me is that wtdbg2 outputted different number of contigs when using single versus multiple cores. Examples are shown below:

My questions are: how does wtdbg2 use multithreading for the assembly, do you have any guess on why wtdbg2 behaves differently under different number of threads for local assembly, and do you have any suggestions on what I should do to avoid assembly instability? I read the relevant issue in https://github.com/ruanjue/wtdbg2/issues/61 but it doesn’t seem to help.

Thank you for your time! Shunhua

ruanjue commented 4 years ago

wtdbg2 calls n cores to perform reads alignment, if provied -A option, the result will be the same whatever the n is. But if no -A, wtdbg2 will skip contained reads. Suppose there are two reads A and B, A contains B, in multiple cores, the skipping of B depends on whether the other core finish A's alignments, thus uncertained.

shunhuahan commented 4 years ago

Thanks for your explanation!

total 1.2M 388K -rw-r--r--. 1 sh60271 cmblab 387K Mar 23 21:16 chr2L_7202325_7204081.reads.fa 632K -rw-r--r--. 1 sh60271 cmblab 630K Mar 23 21:16 chr2L_7202325_7204081.kmerdep 4.0K -rw-r--r--. 1 sh60271 cmblab 93 Mar 23 21:16 chr2L_7202325_7204081.frg.nodes 4.0K -rw-r--r--. 1 sh60271 cmblab 198 Mar 23 21:16 chr2L_7202325_7204081.frg.dot.gz 4.0K -rw-r--r--. 1 sh60271 cmblab 274 Mar 23 21:16 chr2L_7202325_7204081.events 52K -rw-r--r--. 1 sh60271 cmblab 51K Mar 23 21:16 chr2L_7202325_7204081.ctg.lay.gz 4.0K -rw-r--r--. 1 sh60271 cmblab 154 Mar 23 21:16 chr2L_7202325_7204081.ctg.dot.gz 4.0K -rw-r--r--. 1 sh60271 cmblab 702 Mar 23 21:16 chr2L_7202325_7204081.clps 4.0K -rw-r--r--. 1 sh60271 cmblab 537 Mar 23 21:16 chr2L_7202325_7204081.binkmer 8.0K -rw-r--r--. 1 sh60271 cmblab 4.2K Mar 23 21:16 chr2L_7202325_7204081.alignments.gz 4.0K -rw-r--r--. 1 sh60271 cmblab 369 Mar 23 21:16 chr2L_7202325_7204081.3.dot.gz 4.0K -rw-r--r--. 1 sh60271 cmblab 391 Mar 23 21:16 chr2L_7202325_7204081.2.dot.gz 4.0K -rw-r--r--. 1 sh60271 cmblab 2.4K Mar 23 21:16 chr2L_7202325_7204081.1.reads 8.0K -rw-r--r--. 1 sh60271 cmblab 4.7K Mar 23 21:16 chr2L_7202325_7204081.1.nodes 4.0K -rw-r--r--. 1 sh60271 cmblab 1.1K Mar 23 21:16 chr2L_7202325_7204081.1.dot.gz


- And here are the stdout and output files using 1 core.

test2$ wtdbg2 -x rs -g 30k -A -t 1 -i chr2L_7202325_7204081.reads.fa -fo chr2L_7202325_7204081

-- total memory 263724796.0 kB -- available 246640160.0 kB -- 28 cores -- Starting program: wtdbg2 -x rs -g 30k -A -t 1 -i chr2L_7202325_7204081.reads.fa -fo chr2L_7202325_7204081 -- pid 189362 -- date Mon Mar 23 21:18:38 2020

[Mon Mar 23 21:18:38 2020] loading reads 21 reads [Mon Mar 23 21:18:38 2020] Done, 21 reads (>=5000 bp), 395344 bp, 1535 bins PROC_STAT(0) : real 0.001 sec, user 0.000 sec, sys 0.000 sec, maxrss 1048.0 kB, maxvsize 80068.0 kB [Mon Mar 23 21:18:38 2020] Set --edge-cov to 2 KEY PARAMETERS: -k 0 -p 21 -K 1000.049988 -A -S 4.000000 -s 0.050000 -g 30000 -X 50.000000 -e 2 -L 5000 [Mon Mar 23 21:18:38 2020] generating nodes, 1 threads [Mon Mar 23 21:18:38 2020] indexing bins[(0,1535)/1535] (392960/392960 bp), 1 threads [Mon Mar 23 21:18:38 2020] - scanning kmers (K0P21S4.00) from 1535 bins 1535 bins ** Kmer Frequency **

** 1 - 201 ** Quatiles: 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 1 1 1 1 1 1 1 1 2 4 PROC_STAT(0) : real 0.001 sec, user 0.000 sec, sys 0.000 sec, maxrss 1048.0 kB, maxvsize 80068.0 kB [Mon Mar 23 21:18:38 2020] - high frequency kmer depth is set to 1000 [Mon Mar 23 21:18:38 2020] - Total kmers = 56675 [Mon Mar 23 21:18:38 2020] - average kmer depth = 3 [Mon Mar 23 21:18:38 2020] - 53969 low frequency kmers (<2) [Mon Mar 23 21:18:38 2020] - 0 high frequency kmers (>1000) [Mon Mar 23 21:18:38 2020] - indexing 2706 kmers, 8610 instances (at most) 1535 bins [Mon Mar 23 21:18:38 2020] - indexed 2706 kmers, 8603 instances [Mon Mar 23 21:18:38 2020] - masked 725 bins as closed [Mon Mar 23 21:18:38 2020] - sorting PROC_STAT(0) : real 0.101 sec, user 0.070 sec, sys 0.020 sec, maxrss 50096.0 kB, maxvsize 197248.0 kB [Mon Mar 23 21:18:38 2020] Done 20 reads|total hits 136 PROC_STAT(0) : real 0.101 sec, user 0.070 sec, sys 0.020 sec, maxrss 50096.0 kB, maxvsize 197248.0 kB [Mon Mar 23 21:18:38 2020] sorting rdhits ... Done [Mon Mar 23 21:18:38 2020] clipping ... 0.00% bases [Mon Mar 23 21:18:38 2020] generating regs ... 1508 [Mon Mar 23 21:18:38 2020] sorting regs ... Done [Mon Mar 23 21:18:38 2020] generating intervals ... 206 intervals [Mon Mar 23 21:18:38 2020] selecting important intervals from 206 intervals [Mon Mar 23 21:18:38 2020] Intervals: kept 0, discarded 206 PROC_STAT(0) : real 0.101 sec, user 0.070 sec, sys 0.020 sec, maxrss 50096.0 kB, maxvsize 197248.0 kB [Mon Mar 23 21:18:38 2020] Done, 0 nodes [Mon Mar 23 21:18:38 2020] output "chr2L_7202325_7204081.1.nodes". Done. [Mon Mar 23 21:18:38 2020] median node depth = 0 [Mon Mar 23 21:18:38 2020] masked 0 high coverage nodes (>200 or <2) [Mon Mar 23 21:18:38 2020] masked 0 repeat-like nodes by local subgraph analysis [Mon Mar 23 21:18:38 2020] generating edges [Mon Mar 23 21:18:38 2020] Done, 1 edges [Mon Mar 23 21:18:38 2020] output "chr2L_7202325_7204081.1.reads". Done. [Mon Mar 23 21:18:38 2020] output "chr2L_7202325_7204081.1.dot.gz". Done. [Mon Mar 23 21:18:38 2020] graph clean [Mon Mar 23 21:18:38 2020] rescued 0 low cov edges [Mon Mar 23 21:18:38 2020] deleted 0 binary edges [Mon Mar 23 21:18:38 2020] deleted 0 isolated nodes [Mon Mar 23 21:18:38 2020] cut 0 transitive edges [Mon Mar 23 21:18:38 2020] output "chr2L_7202325_7204081.2.dot.gz". Done. [Mon Mar 23 21:18:38 2020] deleted 0 isolated nodes [Mon Mar 23 21:18:38 2020] output "chr2L_7202325_7204081.3.dot.gz". Done. [Mon Mar 23 21:18:38 2020] cut 0 branching nodes [Mon Mar 23 21:18:38 2020] deleted 0 isolated nodes [Mon Mar 23 21:18:38 2020] building unitigs [Mon Mar 23 21:18:38 2020] [Mon Mar 23 21:18:38 2020] output "chr2L_7202325_7204081.frg.nodes". Done. [Mon Mar 23 21:18:38 2020] generating links [Mon Mar 23 21:18:38 2020] generated 1 links [Mon Mar 23 21:18:38 2020] output "chr2L_7202325_7204081.frg.dot.gz". Done. [Mon Mar 23 21:18:38 2020] rescue 0 weak links [Mon Mar 23 21:18:38 2020] deleted 0 binary links [Mon Mar 23 21:18:38 2020] cut 0 transitive links [Mon Mar 23 21:18:38 2020] remove 0 boomerangs [Mon Mar 23 21:18:38 2020] remove 0 weak branches [Mon Mar 23 21:18:38 2020] cut 0 tips [Mon Mar 23 21:18:38 2020] pop 0 bubbles [Mon Mar 23 21:18:38 2020] detached 0 repeat-associated paths [Mon Mar 23 21:18:38 2020] cut 0 tips [Mon Mar 23 21:18:38 2020] output "chr2L_7202325_7204081.ctg.dot.gz". Done. [Mon Mar 23 21:18:38 2020] building contigs [Mon Mar 23 21:18:38 2020] searched 0 contigs [Mon Mar 23 21:18:38 2020] Estimated: [Mon Mar 23 21:18:39 2020] output 0 contigs [Mon Mar 23 21:18:39 2020] Program Done PROC_STAT(TOTAL) : real 1.002 sec, user 0.140 sec, sys 0.050 sec, maxrss 50096.0 kB, maxvsize 205956.0 kB

total 1.1M 388K -rw-r--r--. 1 sh60271 cmblab 387K Mar 23 21:16 chr2L_7202325_7204081.reads.fa 632K -rw-r--r--. 1 sh60271 cmblab 630K Mar 23 21:18 chr2L_7202325_7204081.kmerdep 0 -rw-r--r--. 1 sh60271 cmblab 0 Mar 23 21:18 chr2L_7202325_7204081.frg.nodes 4.0K -rw-r--r--. 1 sh60271 cmblab 72 Mar 23 21:18 chr2L_7202325_7204081.frg.dot.gz 0 -rw-r--r--. 1 sh60271 cmblab 0 Mar 23 21:18 chr2L_7202325_7204081.events 4.0K -rw-r--r--. 1 sh60271 cmblab 72 Mar 23 21:18 chr2L_7202325_7204081.ctg.dot.gz 4.0K -rw-r--r--. 1 sh60271 cmblab 690 Mar 23 21:18 chr2L_7202325_7204081.clps 4.0K -rw-r--r--. 1 sh60271 cmblab 537 Mar 23 21:18 chr2L_7202325_7204081.binkmer 8.0K -rw-r--r--. 1 sh60271 cmblab 4.2K Mar 23 21:18 chr2L_7202325_7204081.alignments.gz 4.0K -rw-r--r--. 1 sh60271 cmblab 72 Mar 23 21:18 chr2L_7202325_7204081.3.dot.gz 4.0K -rw-r--r--. 1 sh60271 cmblab 72 Mar 23 21:18 chr2L_7202325_7204081.2.dot.gz 4.0K -rw-r--r--. 1 sh60271 cmblab 567 Mar 23 21:18 chr2L_7202325_7204081.1.reads 0 -rw-r--r--. 1 sh60271 cmblab 0 Mar 23 21:18 chr2L_7202325_7204081.1.nodes 4.0K -rw-r--r--. 1 sh60271 cmblab 72 Mar 23 21:18 chr2L_7202325_7204081.1.dot.gz 4.0K -rw-r--r--. 1 sh60271 cmblab 30 Mar 23 21:18 chr2L_7202325_7204081.ctg.lay.gz



- I assume the expected behavior would be same number of contig if I add `-A` according to your explanation, is there something else that I missed?

Thanks!
ruanjue commented 4 years ago

Thanks for the report, there seems a BUG at read clipping. Could you send me the test data?

ruanjue.big@qq.com

shunhuahan commented 4 years ago

Just sent, thanks a lot!

ruanjue commented 4 years ago

Fixed, https://github.com/ruanjue/wtdbg2/commit/32ec6b54ec4dd7f7c458174b3987a9709deff9b1 .

shunhuahan commented 4 years ago

I tested this version and now it produced identical results no matter how many cores I used. Thank you so much!

shunhuahan commented 4 years ago

Btw, I'm wondering if you would consider making a new release for wtdbg2. I can help update the package in conda. Thank you!

ruanjue commented 4 years ago

Thanks for the suggestion. I still need to fix more bugs to release a new version.

shunhuahan commented 4 years ago

Got it, look forward to the future release!