kmer distribution - Githubissues

leleory commented 5 years ago

Hi Jue, When I run wtdbg it plots the kmer distribution and suggest that in case of a "not good" distribution I should adjust the -k, -p and -K parameters. The kmer distribution would depend on many factors (e.g. repeat content, ploidy CNVs etc.), nevertheless can you show me plots of distributions you would consider good with some explanations? It would be much appreciated. Thanks, Lel

ruanjue commented 5 years ago

Yes, the kmer distribution varies much between different genomes and different seq types. I introduced this plot for users who want to tune parameters, but I cannot say what is a good kmer distribution for long noisy reads, I also expect experienced users to tell me.

I patse one k-mer distribution of human nanopore data, it looks good in feeling.

Best, Jue

maximilianpress commented 5 years ago

I have a similar question, I am running wtdbg2 using Sequel PB subreads (previously corrected using Canu). Here is my command:

wtdbg2 -i correctedReads.fasta -o tetra_wtdbg2_1.3low -g 1.3g -t 0 -f -x corrected --edge-min 2 --rescue-low-cov-edges

I attach below a sample from my logfile.

In another issue you looked at a similar distribution with no peaks and said it was ok for RSII data.

Is it ok for Sequel data? If not, do you have any parameterization suggestions to try to make it better?

Many thanks, Max

[Fri Jan 25 21:40:18 2019] generating nodes, 64 threads
[Fri Jan 25 21:40:18 2019] indexing bins[0,98172916] (25132266496 bp), 64 threads
[Fri Jan 25 21:40:18 2019] - scanning kmers (K19P0S4.00) from 98172916 bins
98172916 bins
********************** Kmer Frequency **********************
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
||                                                                                                  
||||                                                                                                
|||||                                                                                               
||||||                                                                                              
|||||||                                                                                             
||||||||                                                                                            
||||||||||                                                                                          
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
**********************     1 - 201    **********************
Quatiles:
   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%
     2     6    11    21    73   297  1074  5025 17704 30574
# If the kmer distribution is not good, please kill me and adjust -k, -p, and -K
# Cannot get a good distribution anyway, should adjust -S -s, also -A -e in assembly
** PROC_STAT(0) **: real 5966.204 sec, user 4112.020 sec, sys 371.420 sec, maxrss 26459848.0 kB, maxvsize 36099220.0 kB
[Fri Jan 25 21:42:09 2019] - high frequency kmer depth is set to 32633
[Fri Jan 25 21:42:22 2019] - Total kmers = 839566821
[Fri Jan 25 21:42:22 2019] - average kmer depth = 14
[Fri Jan 25 21:42:22 2019] - 500740618 low frequency kmers (<2)
[Fri Jan 25 21:42:22 2019] - 4278 high frequency kmers (>32633)
[Fri Jan 25 21:42:22 2019] - indexing 338821925 kmers, 4885158132 instances (at most)
98172916 bins
[Fri Jan 25 21:44:05 2019] - indexed  338821925 kmers, 4872988802 instances
[Fri Jan 25 21:44:05 2019] - masked 1009496 bins as closed

ruanjue commented 5 years ago

Singleton composed most of kmers (5.0M/8.4M), while the cutoff of high frequency was 32633, too high. In my view, your data came from high repetitive genome, and the correction still left many errors.

If your assembly is not ideal, please try -x rs or tune detailed parameters.

maximilianpress commented 5 years ago

Thanks so much, I will try that.

I think that the coverage is somewhat low for the application, so maybe it is not surprising that correction was not very successful. The organism in question is tetraploid, which may also contribute to difficulties.

maximilianpress commented 5 years ago

The assembly was not great (low contiguity, N50~90Kbp), so I did as you suggested and started another job with -x rs.

In this case I am seeing a definite non-singleton mode of the k-mer distribution, but the singleton number is still quite high (150M/345M) and the high frequency cutoff is not much lower (see log below).

Therefore I may investigate the detailed parameters. Is there a disciplined way to tune the detailed parameters, or is it a more heuristic process of trying many combinations and seeing what gives a reasonable k-mer distribution?

Many thanks, Max

[Sun Jan 27 19:54:37 2019] generating nodes, 64 threads
[Sun Jan 27 19:54:37 2019] indexing bins[0,98172916] (25132266496 bp), 64 threads
[Sun Jan 27 19:54:37 2019] - scanning kmers (K0P21S4.00) from 98172916 bins
********************** Kmer Frequency **********************
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|                                                                                                   
|  ||                                                                                               
|  ||                                                                                               
| ||||                                                                                              
| ||||                                                                                              
||||||                                                                                              
|||||||                                                                                             
|||||||                                                                                             
||||||||                                                                                            
||||||||                                                                                            
|||||||||                                                                                           
||||||||||                                                                                          
|||||||||||                                                                                         
||||||||||||                                                                                        
|||||||||||||||                                                                                     
|||||||||||||||||||                                                                                 
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
**********************     1 - 201    **********************
Quatiles:
   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%
     5     9    15    30    86   308  1045  5112 17826 29569
# If the kmer distribution is not good, please kill me and adjust -k, -p, and -K
# Cannot get a good distribution anyway, should adjust -S -s, also -A -e in assembly
** PROC_STAT(0) **: real 157.042 sec, user 2445.210 sec, sys 87.760 sec, maxrss 14267096.0 kB, maxvsize 24320904.0 kB
[Sun Jan 27 19:55:39 2019] - high frequency kmer depth is set to 30923
[Sun Jan 27 19:55:46 2019] - Total kmers = 345252028
[Sun Jan 27 19:55:46 2019] - average kmer depth = 17
[Sun Jan 27 19:55:46 2019] - 150306460 low frequency kmers (<2)
[Sun Jan 27 19:55:46 2019] - 3388 high frequency kmers (>30923)
[Sun Jan 27 19:55:46 2019] - indexing 194942180 kmers, 3401041442 instances (at most)

ruanjue commented 5 years ago

In my experience, a ideal assembly shows 1) most of k-mers within 2\~1000, 2) avg k-mer depth 20 \~ 100, 3) clip 10~20% bases after alignment.

As you are assembling a very complicated genome, add -S 2 to get better alignments. About -s 0.5, I think you need to change it to a smaller value, as the error rate is still high.

Here, I paste the k-mer distribution when assembling Human PacBio CCS reads (err=1%).

[Fri Nov  2 14:49:14 2018] - scanning kmers (K21P0S8.00) from 344661446 bins
344661446 bins
********************** Kmer Frequency **********************
             |
             |
            |||
            |||
            |||
            |||
            |||
           |||||
           |||||
           |||||
           |||||
           |||||
           |||||
           ||||||
          |||||||
          |||||||
|         |||||||
|         |||||||
|         ||||||||
|         ||||||||
|        |||||||||
|        |||||||||
|        |||||||||
|        ||||||||||
|       |||||||||||
|       |||||||||||
|      |||||||||||||
|      |||||||||||||
|     |||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
**********************     1 - 201    **********************
Quatiles:
   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%
    19    23    25    27    30    32    37    93  2628 24228
# If the kmer distribution is not good, please kill me and adjust -k, -p
# Cannot get a good distribution anyway, should adjust -S -s, also -A -e in assembly
** PROC_STAT(0) **: real 727.009 sec, user 7371.840 sec, sys 717.690 sec, maxrss 38534860.0 kB, maxvsize 53316136.0 kB
[Fri Nov  2 14:51:40 2018] - Total kmers = 547546853
[Fri Nov  2 14:51:40 2018] - average kmer depth = 25
[Fri Nov  2 14:51:40 2018] - 220527186 low frequency kmers (<2)
[Fri Nov  2 14:51:40 2018] - 230001 high frequency kmers (>1000)
[Fri Nov  2 14:51:40 2018] - indexing 326789666 kmers, 8360317014 instances (at most)

maximilianpress commented 5 years ago

Thanks- I think that this is enough information to proceed with.

I think that I can attribute my difficulties to shortcomings of the dataset. I will explore alternatives, I think the issue can be closed.

ruanjue / wtdbg2

kmer distribution #51