Help with metrics - Githubissues

cement-head commented 4 years ago

I have several questions regarding wtdbg2 output (metrics) that I don't understand.

What is the difference between -p (Kmer fsize) and -k (Kmer psize)?
There are two kmer files that are generated by wtdbg2; a <.binkmer> and a <.kmerdep>. When evaluating the quality of an assembly, which one should I look at, and what should it look like? Why are there two columns of numbers in the <*.kmerdep> file? What do the columns represent?
I don't quite understand the output from the seqN50.pl script - there are TWO numbers next to each metric - For example: N50: 369759 3921. Does that mean that there are 3921 contigs at the N50 level and the N50 bp length is 369,759?

ruanjue commented 4 years ago

1, p is homopolymer compressed bases, k is original bases. 2, just ignore them 3, N50 and L50

cement-head commented 4 years ago

Okay. Thanks. With regard to Q2; I'll assume that the kmer metrics I should be monitoring should be in the terminal output?

ruanjue commented 4 years ago

About Q2, I don't think user should monitor it, the relation between kmer metrics and assembly is not so direct.

cement-head commented 4 years ago

In that case, how does one validate an assembly with no known reference? BUSCO only? I was hoping that a kmer analysis would help me understand the validity of the assembly.

ruanjue commented 4 years ago

You can choose BUSCO, Paired-end short reads mapping, long reads mapping, bionano mapping, and others. Kmer ananlysis can be done with assembled contigs and short reads, but not very good at long noisy reads.

ruanjue / wtdbg2

Help with metrics #164