Closed ASLeonard closed 3 years ago
Just noticed the description in the meryl-lookup -wig-depth
option
Generate a WIGGLE format file showing the number of kmers in any input database that cover each position in the sequence
I guess this does then explain the triangle shapes, as kmers are effectively a sliding window across a sequence. I guess a possible issue with (over?)interpreting these results is height/count of these triangle may scale more rapidly with coverage (quadratic?) compared than the number of underlying errors (linear?).
Hello @ASLeonard,
What you are seeing is the kmer coverage of the error / pat / mat mers. The triangle shape max height depends on the kmer size. The width of the triangle will become longer when there are multiple of the quereying kmers seen along the genome, adjacent to each other. The QV estimation is taking account for the kmer size and the number of kmers.
In your above second plot, it looks like there were paternal hapmers found right before maternal hapmers, with assembly-only (error) k-mers around. This is a typical signature we see at or around chimeric / haplotype sequence junctions.
So having a nonzero value doesn't directly mean there is an error with that base, rather that base is covered by a kmer which contains a mismatch somewhere within its width?
I guess more simply, is the above plot indicating there are errors with all the bases in that region, or only a few bases are errors (approximately one per triangle peak)?
Only a few bases are errors per triangle peak. Roughly, a single base pair error affects k kmers, covering the error base + 2 (k-1) bases. So you could estimate the num. base errors = bases covered by the peak - 2(k-1), with the caveat that there might be kmers made by the error base that is found in a real genome by coincidence.
This usually happens in repetitive regions where the kmer space is limited.
Great, thanks for clarifying!
Hi Arang, The latest changes are great, it is significantly faster and easier to work with! My question was around some of the new .wig files, which replaced the .tdf files I never got around to exploring. In this figure, the rows are QV, paternal hapmers, and maternal hapmers respectively.
Is the interpretation that the numbers are respectively the number of
Also when zooming in significantly (100bp), they all tend to take on triangular shapes, which I wouldn't naively expect if the above definitions are near correct. It may be related to some windowed overlap (like the convolution of two boxcar functions), but that doesn't particularly seem realistic given the genomic context and presumed variability in coverage.
Thanks, Alex