ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

Help with metrics #164

Closed cement-head closed 4 years ago

cement-head commented 4 years ago

I have several questions regarding wtdbg2 output (metrics) that I don't understand.

  1. What is the difference between -p (Kmer fsize) and -k (Kmer psize)?
  2. There are two kmer files that are generated by wtdbg2; a <.binkmer> and a <.kmerdep>. When evaluating the quality of an assembly, which one should I look at, and what should it look like? Why are there two columns of numbers in the <*.kmerdep> file? What do the columns represent?
  3. I don't quite understand the output from the seqN50.pl script - there are TWO numbers next to each metric - For example: N50: 369759 3921. Does that mean that there are 3921 contigs at the N50 level and the N50 bp length is 369,759?
ruanjue commented 4 years ago

1, p is homopolymer compressed bases, k is original bases. 2, just ignore them 3, N50 and L50

cement-head commented 4 years ago

Okay. Thanks. With regard to Q2; I'll assume that the kmer metrics I should be monitoring should be in the terminal output?

ruanjue commented 4 years ago

About Q2, I don't think user should monitor it, the relation between kmer metrics and assembly is not so direct.

cement-head commented 4 years ago

In that case, how does one validate an assembly with no known reference? BUSCO only? I was hoping that a kmer analysis would help me understand the validity of the assembly.

ruanjue commented 4 years ago

You can choose BUSCO, Paired-end short reads mapping, long reads mapping, bionano mapping, and others. Kmer ananlysis can be done with assembled contigs and short reads, but not very good at long noisy reads.