marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

Interpreting the new wig files for QV and hapmer density #44

Closed ASLeonard closed 3 years ago

ASLeonard commented 3 years ago

Hi Arang, The latest changes are great, it is significantly faster and easier to work with! My question was around some of the new .wig files, which replaced the .tdf files I never got around to exploring. In this figure, the rows are QV, paternal hapmers, and maternal hapmers respectively.

image

Is the interpretation that the numbers are respectively the number of

Also when zooming in significantly (100bp), they all tend to take on triangular shapes, which I wouldn't naively expect if the above definitions are near correct. It may be related to some windowed overlap (like the convolution of two boxcar functions), but that doesn't particularly seem realistic given the genomic context and presumed variability in coverage.

image

Thanks, Alex

ASLeonard commented 3 years ago

Just noticed the description in the meryl-lookup -wig-depth option

Generate a WIGGLE format file showing the number of kmers in any input database that cover each position in the sequence

I guess this does then explain the triangle shapes, as kmers are effectively a sliding window across a sequence. I guess a possible issue with (over?)interpreting these results is height/count of these triangle may scale more rapidly with coverage (quadratic?) compared than the number of underlying errors (linear?).

arangrhie commented 3 years ago

Hello @ASLeonard,

What you are seeing is the kmer coverage of the error / pat / mat mers. The triangle shape max height depends on the kmer size. The width of the triangle will become longer when there are multiple of the quereying kmers seen along the genome, adjacent to each other. The QV estimation is taking account for the kmer size and the number of kmers.

In your above second plot, it looks like there were paternal hapmers found right before maternal hapmers, with assembly-only (error) k-mers around. This is a typical signature we see at or around chimeric / haplotype sequence junctions.

ASLeonard commented 3 years ago

So having a nonzero value doesn't directly mean there is an error with that base, rather that base is covered by a kmer which contains a mismatch somewhere within its width?

I guess more simply, is the above plot indicating there are errors with all the bases in that region, or only a few bases are errors (approximately one per triangle peak)?

arangrhie commented 3 years ago

Only a few bases are errors per triangle peak. Roughly, a single base pair error affects k kmers, covering the error base + 2 (k-1) bases. So you could estimate the num. base errors = bases covered by the peak - 2(k-1), with the caveat that there might be kmers made by the error base that is found in a real genome by coincidence.

This usually happens in repetitive regions where the kmer space is limited.

ASLeonard commented 3 years ago

Great, thanks for clarifying!