ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

Alignment step running for four days #134

Closed pi3rrr3 closed 4 years ago

pi3rrr3 commented 5 years ago

Hi,

I try to use the latest version of wtdbg2 for assembling a ~300 Mb insect genome from 70X PacBio Sequel data. Installation was flawless, just outputted the following:

gcc -g3 -W -Wall -Wno-unused-but-set-variable -O4 -DVERSION="2.5" -DRELEASE="20190621" -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -mpopcnt -msse4.2 -o kbm2 kbm.c ksw.c -lm -lrt -lpthread -lz gcc -g3 -W -Wall -Wno-unused-but-set-variable -O4 -DVERSION="2.5" -DRELEASE="20190621" -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -mpopcnt -msse4.2 -o wtdbg2 wtdbg.c ksw.c -lm -lrt -lpthread -lz gcc -g3 -W -Wall -Wno-unused-but-set-variable -O4 -DVERSION="2.5" -DRELEASE="20190621" -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -mpopcnt -msse4.2 -o wtdbg-cns wtdbg-cns.c ksw.c -lm -lrt -lpthread -lz gcc -g3 -W -Wall -Wno-unused-but-set-variable -O4 -DVERSION="2.5" -DRELEASE="20190621" -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -mpopcnt -msse4.2 -o wtpoa-cns wtpoa-cns.c ksw.c -lm -lrt -lpthread -lz gcc -g3 -W -Wall -Wno-unused-but-set-variable -O4 -DVERSION="2.5" -DRELEASE="20190621" -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -mpopcnt -msse4.2 -o pgzf pgzf.c -lm -lrt -lpthread -lz

I am using the following command, using 40 cores and 256 Gb of RAM:

~/tools/wtdbg2/wtdbg2 -x sq -g 300m -L 5000 -i raw/fasta/combined_reads.fa.gz -t 40 -fo assembly/dbg_l5k_par

Kmer indexing step went fine (see output below), but the alignment step has been running for almost four days now. The .alignments file size keeps growing (1.3 Gb at the moment).

Any idea what might be going wrong? Thanks for your help, and for the fine piece of software!

total memory 1588379172.0 kB available 1495883572.0 kB 40 cores Starting program: /homeappl/home/myhome/tools/wtdbg2/wtdbg2 -x sq -g 300m -L 5000 -i raw/fasta/combined_reads.fa.gz -t 40 -fo assembly/dbg_l5k pid 82684 date Thu Jul 18 10:05:00 2019 [Thu Jul 18 10:05:00 2019] loading reads 1050470 reads [Thu Jul 18 10:07:33 2019] Done, 1050470 reads (>=5000 bp), 12085888006 bp, 46685793 bins PROC_STAT(0) : real 152.820 sec, user 114.300 sec, sys 38.300 sec, maxrss 3857904.0 kB, maxvsize 9859352.0 kB [Thu Jul 18 10:07:33 2019] Set --edge-cov to 3 KEY PARAMETERS: -k 15 -p 0 -K 1000.049988 -A -S 2.000000 -s 0.050000 -g 300000000 -X 50.000000 -e 3 -L 5000 [Thu Jul 18 10:07:33 2019] generating nodes, 40 threads [Thu Jul 18 10:07:33 2019] indexing bins[(0,46685793)/46685793] (11951563008/11951563008 bp), 40 threads [Thu Jul 18 10:07:34 2019] - scanning kmers (K15P0S2.00) from 46685793 bins 10800000 20500000 30100000 39800000 46685793 bins ** Kmer Frequency **

** 1 - 201 ** Quatiles: 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 8 13 21 32 51 92 201 617 3958 17387 PROC_STAT(0) : real 2794.663 sec, user 2401.010 sec, sys 385.250 sec, maxrss 9458180.0 kB, maxvsize 98053948.0 kB [Thu Jul 18 10:51:35 2019] - high frequency kmer depth is set to 18358 [Thu Jul 18 10:51:37 2019] - Total kmers = 263355106 [Thu Jul 18 10:51:37 2019] - average kmer depth = 20 [Thu Jul 18 10:51:37 2019] - 12751646 low frequency kmers (<2) [Thu Jul 18 10:51:37 2019] - 5410 high frequency kmers (>18358) [Thu Jul 18 10:51:37 2019] - indexing 250598050 kmers, 5089892297 instances (at most) 10800000 20500000 30100000 39800000 46685793 bins [Thu Jul 18 12:34:47 2019] - indexed 250598050 kmers, 5086892127 instances [Thu Jul 18 12:34:47 2019] - masked 102918 bins as closed [Thu Jul 18 12:34:47 2019] - sorting PROC_STAT(0) : real 9119.663 sec, user 7793.620 sec, sys 1228.030 sec, maxrss 39623296.0 kB, maxvsize 128966668.0 kB [Thu Jul 18 12:37:00 2019] Done 118000|29766001 124000|30640654

ruanjue commented 5 years ago

It looks like a very high repetitve genome. The first step is looking at the CPU usage by top, whether wtdbg2 takes nearly all 40 cores. The next step is re-run wtdbg2 by increasing k-mer size to get it finish faster, I suggest -x sq -k 0 -p 19 .

pi3rrr3 commented 5 years ago

Thanks for the quick answer, I am trying these parameters now and will let you know. Also, would increasing -L help? I have 30X coverage with reads over 12kb.

ruanjue commented 5 years ago

Firstly try increase kmer-size, wtdbg2 automaticly select 50X data from input reads.

chunlinxiao commented 4 years ago

do we need to set -X if we have more than 50X data?

ruanjue commented 4 years ago

The default parameters is -X 50. Please have a look at the usage.

 -X <float>, --rdcov-cutoff <float>
   Default: 50.0. Retaining 50.0 folds of genome coverage ...