Kmer distribution differences between raw reads and corrected reads

zy041225 commented 5 years ago

Hi Jue,

I'm trying to assemble a genome using wtdbg2 with default parameters. I've run two kinds of assembling, one only using wtdbg2 with raw sequel reads, and one using CANU first to get corrected reads then assembled with wtdbg2. I found the k-mer distributions are different between them (the first one was generated from raw sequel reads, the second one was generated from CANU-corrected reads), and it seemed that the corrected-reads version looks better. However, my CANU-wtdbg2 assembly is 200Mb smaller than the genome size I estimated from Illumina reads while the wtdbg2-only assembly size is OK. Thus I would like to ask if the k-mer distribution of the raw reads is good to proceed and the result is reliable. Besides, do you have any suggestion about parameter tuning for the corrected version?

Here are my parameters used in wtdb2

# raw reads
wtdbg2 -x sq -g 2.4g -i raw_reads.fa.gz -fo out
# corrected reads
wtdbg2 -x corrected -g 2.4g -i correctedReads.fasta.gz --rescue-low-cov-edges -S 1 --tidy-reads 1000 -fo out

Many thanks

Best Yang

********************** Kmer Frequency **********************
                         |
                    |||||||||||
                  ||||||||||||||||
               |||||||||||||||||||||
           ||||||||||||||||||||||||||||
        |||||||||||||||||||||||||||||||||
       |||||||||||||||||||||||||||||||||||||
       ||||||||||||||||||||||||||||||||||||||||
      |||||||||||||||||||||||||||||||||||||||||||||
      |||||||||||||||||||||||||||||||||||||||||||||||||||
     |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
     |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
    ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
    ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
   |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
   |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
   |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
**********************     1 - 201    **********************
Quatiles:
   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%
    41    68   106   156   206   261   332   459   996  3541
# If the kmer distribution is not good, please kill me and adjust -k, -p, and -K
# Cannot get a good distribution anyway, should adjust -S -s, also -A -e in assembly
** PROC_STAT(0) **: real 2425.820 sec, user 21968.650 sec, sys 306.470 sec, maxrss 45574272.0 kB, maxvsize 48586804.0 kB
[Mon Feb 18 11:46:00 2019] - high frequency kmer depth is set to 3576
[Mon Feb 18 11:46:00 2019] - Total kmers = 387154390
[Mon Feb 18 11:46:00 2019] - average kmer depth = 88
[Mon Feb 18 11:46:00 2019] - 891029 low frequency kmers (<2)
[Mon Feb 18 11:46:00 2019] - 178148 high frequency kmers (>3576)

********************** Kmer Frequency **********************
                |
|              |||
|             ||||
|             |||||
|            |||||||
|            |||||||
|           |||||||||
|           |||||||||
|          |||||||||||
|        |||||||||||||
|      ||||||||||||||||
|      |||||||||||||||||
|     |||||||||||||||||||||||
|    |||||||||||||||||||||||||||||
||   ||||||||||||||||||||||||||||||||
||   ||||||||||||||||||||||||||||||||||
||  |||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
**********************     1 - 201    **********************
Quatiles:
   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%
    17    27    34    42    55    73   105   260  3440 17220
# If the kmer distribution is not good, please kill me and adjust -k, -p, and -K
# Cannot get a good distribution anyway, should adjust -S -s, also -A -e in assembly
** PROC_STAT(0) **: real 2233.819 sec, user 19576.280 sec, sys 300.290 sec, maxrss 59664132.0 kB, maxvsize 64697116.0 kB
[Mon Mar  4 15:14:04 2019] - high frequency kmer depth is set to 18150
[Mon Mar  4 15:14:05 2019] - Total kmers = 1861927320
[Mon Mar  4 15:14:05 2019] - average kmer depth = 32
[Mon Mar  4 15:14:05 2019] - 632837599 low frequency kmers (<2)
[Mon Mar  4 15:14:05 2019] - 45164 high frequency kmers (>18150)

ruanjue commented 5 years ago

Hi,

Both kmers looks good. The kmer-distribution is used in indexing, if your assembly is good enough, ignore it, otherwise please make sure most of k-mers lay within 2~<-K>, esp. 95% quantile not too high.

I have no more suggestion on how to tuning on corrected reads, Ihave little experience with it.

Best, Jue

zy041225 commented 5 years ago

Thanks for the answer.

ruanjue / wtdbg2

Kmer distribution differences between raw reads and corrected reads #90