Long runtime - choosing correct parameters

ctxchris commented 3 years ago

Hi,

I'm using wtdbg2 v2.5 on a 3G genome with about 60x PacBio Sequel CLR data and chose -x preset3 for large genomes. Kmer counting was done in 90 minutes with 60 threads. The overlap stage is running for eight days now and still not finished, which seems quite high compared to runtimes other get on similar genome sizes and data. Would using -x sq speed things up or do you have recommendations on which parameter set to use?

Thanks, Chris

ruanjue commented 3 years ago

Have a look at the Quatiles log message, like this

Quatiles:
   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%
    16    21    29    72   268   972  3329 11697 42939 65535

If you find the kmer were highly repetitive, please set -K 2000 to speed the alignment up.

Otherwise, please paste the log message.

Jue

ctxchris commented 3 years ago

The kmer distribution looks ok to me.

[Fri Feb  5 16:21:16 2021] loading reads

[Fri Feb  5 16:40:16 2021] Done, 6180880 reads (>=2000 bp), 146787898735 bp, 570312591 bins
** PROC_STAT(0) **: real 1139.711 sec, user 1011.110 sec, sys 308.470 sec, maxrss 43128776.0 kB, maxvsize 49290412.0 kB
[Fri Feb  5 16:40:16 2021] Set --edge-cov to 3
KEY PARAMETERS: -k 0 -p 19 -K 1000.049988 -A -S 2.000000 -s 0.050000 -g 3000000000 -X 50.000000 -e 3 -L 2000
[Fri Feb  5 16:40:16 2021] generating nodes, 60 threads
[Fri Feb  5 16:40:16 2021] indexing bins[(0,570312591)/570312591] (146000023296/146000023296 bp), 60 threads
[Fri Feb  5 16:40:17 2021] - scanning kmers (K0P19S2.00) from 570312591 bins
********************** Kmer Frequency **********************

                   |||||||||||                                                                      
                 ||||||||||||||||                                                                   
               ||||||||||||||||||||                                                                 
              ||||||||||||||||||||||||                                                              
           |||||||||||||||||||||||||||||                                                            
     ||||||||||||||||||||||||||||||||||||||                                                         
     |||||||||||||||||||||||||||||||||||||||||                                                      
    |||||||||||||||||||||||||||||||||||||||||||||                                                   
    ||||||||||||||||||||||||||||||||||||||||||||||||||                                              
   |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                        ||||||||||||||
   |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
   |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
**********************     1 - 201    **********************
Quatiles:
   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%
    52   101   179   253   331   433   597   997  3791 16039
** PROC_STAT(0) **: real 2187.698 sec, user 51935.190 sec, sys 1418.630 sec, maxrss 52014648.0 kB, maxvsize 62463332.0 kB
[Fri Feb  5 16:57:44 2021] - high frequency kmer depth is set to 16534
[Fri Feb  5 16:57:45 2021] - Total kmers = 386006344
[Fri Feb  5 16:57:45 2021] - average kmer depth = 112
[Fri Feb  5 16:57:45 2021] - 3606933 low frequency kmers (<2)
[Fri Feb  5 16:57:45 2021] - 50726 high frequency kmers (>16534)
[Fri Feb  5 16:57:45 2021] - indexing 382348685 kmers, 42871080592 instances (at most)
[Fri Feb  5 17:29:47 2021] - indexed  382348685 kmers, 42867244619 instances
[Fri Feb  5 17:29:53 2021] - masked 655153 bins as closed
[Fri Feb  5 17:29:53 2021] - sorting
** PROC_STAT(0) **: real 4191.975 sec, user 152004.400 sec, sys 5049.500 sec, maxrss 303213028.0 kB, maxvsize 313438404.0 kB

ruanjue commented 3 years ago

[Fri Feb  5 16:57:44 2021] - high frequency kmer depth is set to 16534

Try to set -K 2000 to speed it up.

shri1984 commented 3 years ago

Hi, I am assembling a large genome. I tried to use -K 2000 after I see high frequency kmer depth set to about 16K. I had 6 cells of sq CLR data. My run went extremely fast. I just wonder by setting K=2000, is there an effect on the quality of the final assembly in terms of total length /number of contigs?

ruanjue commented 3 years ago

In this case, if discarded too many high freq kmers, it may lead to fragmental and truncated contigs.

ruanjue / wtdbg2

Long runtime - choosing correct parameters #229