Closed priyanka-surana closed 1 year ago
Hello Priyanka,
The Total in QV are kmers that are 'present' in the assembly. So if there is one specific kmer found 3 times in the assembly, but never in the reads, it is counted as 3 error kmers (no suppurt). The 3 error kmers are part of the Total.
The Total in completeness are distinct solid kmers in the reads. In other words, a kmer that is present over a certain frequency in the reads is counted as one kmer. I forgot how exactly the Total is computed in MerquryFK completeness. It's likely that it is only filtering out kmers with frequency of 1, which is the default in FastK? Might be a good question for Gene.
Best, Arang
@arangrhie I posted the question with FastK (Understanding how kmers are counted #24). There may be a difference in the way kmers are counted between merqury and merquryFK, because the total for QC is ~8M whereas the total for Completeness is ~2.2B, which if I understood you correctly is the opposite of what would be expected.
You mean total for QV? That's counting the kmers in the assembly, so it should match the assembly size, which is ~8M. Total in Merqury usually contains erroneous kmers in the reads, even though filtered for low frequency kmers, so it's nearly impossible to reach 100%. Also there are always edge cases in HiFi assemblies where HiFi or Illumina sequencing biases are involved (e.g. homopolymer / 2-mer microsatellite indel errors, GC biases), so having 100% completeness seems very suspicious...
Closing this for now. @priyanka-surana let me know if you need more from me!
Hello,
Why do the total values in
completeness.stats
andqv
files differ so much? What do they represent and how they relate to each other? I runmerquryfk
with a single genome assembled using Pacbio HiFi and HiC data, and run against an Illumina kmer dataset.Thanks 😊 Priyanka