marbl / merqury

k-mer based assembly evaluation
Other
280 stars 19 forks source link

Understanding what total represents in different output files #84

Closed priyanka-surana closed 1 year ago

priyanka-surana commented 1 year ago

Hello,

Why do the total values in completeness.stats and qv files differ so much? What do they represent and how they relate to each other? I run merquryfk with a single genome assembled using Pacbio HiFi and HiC data, and run against an Illumina kmer dataset.

# mMelMel1_T1.qv 
Assembly    No Support  Total   Error % QV
GCA_922984935.2.subset  6005    7999890 0.0024  46.2

# mMelMel1_T1.completeness.stats 
Assembly    Region  Found   Total   % Covered
GCA_922984935.2.subset  all 2268391877  2268397787  100.00

Thanks 😊 Priyanka

arangrhie commented 1 year ago

Hello Priyanka,

The Total in QV are kmers that are 'present' in the assembly. So if there is one specific kmer found 3 times in the assembly, but never in the reads, it is counted as 3 error kmers (no suppurt). The 3 error kmers are part of the Total.

The Total in completeness are distinct solid kmers in the reads. In other words, a kmer that is present over a certain frequency in the reads is counted as one kmer. I forgot how exactly the Total is computed in MerquryFK completeness. It's likely that it is only filtering out kmers with frequency of 1, which is the default in FastK? Might be a good question for Gene.

Best, Arang

priyanka-surana commented 1 year ago

@arangrhie I posted the question with FastK (Understanding how kmers are counted #24). There may be a difference in the way kmers are counted between merqury and merquryFK, because the total for QC is ~8M whereas the total for Completeness is ~2.2B, which if I understood you correctly is the opposite of what would be expected.

arangrhie commented 1 year ago

You mean total for QV? That's counting the kmers in the assembly, so it should match the assembly size, which is ~8M. Total in Merqury usually contains erroneous kmers in the reads, even though filtered for low frequency kmers, so it's nearly impossible to reach 100%. Also there are always edge cases in HiFi assemblies where HiFi or Illumina sequencing biases are involved (e.g. homopolymer / 2-mer microsatellite indel errors, GC biases), so having 100% completeness seems very suspicious...

arangrhie commented 1 year ago

Closing this for now. @priyanka-surana let me know if you need more from me!