ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

suggestions on settings #86

Closed xiaoyezao closed 5 years ago

xiaoyezao commented 5 years ago

Hi, I am running wtdbg2 on a 1.4G plant genome. We have about 100x Pacbio Sequel data. I use the default settings : wtdbg2 -x sq -t 0 -i *.gz -fo rheum, and get the following output:

Screen Shot 2019-03-19 at 9 18 29 PM

Seems that I need to adjust the settings. Can you please give some suggestions?

Thank you !

sunnycqcn commented 5 years ago

Hello, I met the same issue. Could you give us some suggestions about how to adjust the settings? Thanks, Fuyou

ruanjue commented 5 years ago

The kmer-distribution from xiaoyezao looks ok. If the assembly is bad, please shift the kmers left by increase -p to 21.

sunnycqcn commented 5 years ago

Hello, I am much appreciated for your developing this software. It is much faster than CANU and FALCON. Howerve, I find it is difficult to setup the suitiable parameters. I used the command as following: ~/DIRECTORY/wtdbg2/wtdbg2 -i pb.fasta -t 0 -o str1 -x sq -p 0 -k 15 -AS 2 -s 0.05 -L 1000 -e 1 --edge-min 2 --rescue-low-cov-edges 2>str1.assembly.log My genome is about 1.8 gb with about 2.4% heterozygous rate. I used about 40x pacbio reads. Then I get the k mer frquency as following: ** Kmer Frequency **

          |||||||                                                                               
         ||||||||||                                                                             
        |||||||||||||                                                                           
       ||||||||||||||||                                                                         
       ||||||||||||||||||                                                                       
      ||||||||||||||||||||||                                                                    
      ||||||||||||||||||||||||                                                                  
     |||||||||||||||||||||||||||                                                                
     ||||||||||||||||||||||||||||||                                                             
    |||||||||||||||||||||||||||||||||||                                                         
    ||||||||||||||||||||||||||||||||||||||                                                      
   ||||||||||||||||||||||||||||||||||||||||||||                                                 
   |||||||||||||||||||||||||||||||||||||||||||||||||                                            
   ||||||||||||||||||||||||||||||||||||||||||||||||||||||||                                     
  |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                           
  |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||             
 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ** 1 - 201 ** Quatiles: 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 31 46 65 91 131 198 328 684 2421 7832

If the kmer distribution is not good, please kill me and adjust -k, -p, and -K

Cannot get a good distribution anyway, should adjust -S -s, also -A -e in assembly

PROC_STAT(0) : real 2172.201 sec, user 7537.710 sec, sys 177.170 sec, maxrss 19098860.0 kB, maxvsize 22911596.0 kB [Tue Mar 26 14:10:41 2019] - high frequency kmer depth is set to 7915 [Tue Mar 26 14:10:42 2019] - Total kmers = 268432957 [Tue Mar 26 14:10:42 2019] - average kmer depth = 74 [Tue Mar 26 14:10:42 2019] - 3554 low frequency kmers (<2) [Tue Mar 26 14:10:42 2019] - 52000 high frequency kmers (>7915) [Tue Mar 26 14:10:42 2019] - indexing 268377403 kmers, 19960619941 instances (at most) The genome assembly results is much worse than using smartdenovo with CANU corrected reads. In addition, I find it will take less time if increase -p value with worse assembly. Based on my issue, could you give me some suggestions or how to adjust -p and -k. I find it is very trick to adjust -p and -k. I assemblied other small genome about 400 mb. I can get a good genome with -p 0 -k15, but if I changed -k 15 to -k 17, the assemblied genome is much worse. Thanks, Fuyou

ruanjue commented 5 years ago

Wtdbg2 provides presets to setup parameters, in your case, first please try -x sq -g 1.8g. I am not sure 40X sq data can assemble a good genome with 2.4% heter rate.

sunnycqcn commented 5 years ago

Hello, I am much appreciated for you suggestions. I assemblied six plant genome using pacbio raw reads. one of them is I said with about 40x data. Other five of them are about 80x data. However, the results of wtdbg2 is not better than smartdenovo with corrected data. I do not know what it is the reason. Thanks, Fuyou

ruanjue commented 5 years ago

For corrected reads, please use -x ccs. wtdbg2 is faster, but sometime get less contiguity than smartdenovo. However, I won't update smartdenovo anymore.