Closed ASLeonard closed 2 years ago
Hello Alex!
Solid k-mer completeness are measuring based on 'all' k-mers in the genome. The solid k-mer completeness is measuring how much of the genome is assembled / the expected genome inferred from k-mers.
Hapmer completeness measures based on haplotype specific mers, how complete the haplotype specific k-mers are captured (phased) in the assembly. These hapmers majorly overlap with a subset of solid k-mers, but I won't use it to replace the solid k-mer completeness, as this is only measuring part of the genome.
I think it's more accurate to say in your plot that haplotype specific sequences are well captured in its near completeness while solid k-mer completeness indicates some k-mers shared by both parental genomes are missing in the assemblies.
Arang
Ah yeah makes more sense. I was starting to view hapmers as essentially the solid k-mers overlapping with that parent, rather than unique parental k-mers.
On a related note, here is an example from a HiFi-based trio, ~ 3gb size. The combined completeness is extremely high, but the individual haplotypes are a fair bit lower.
hap1 all 1874413966 2124335604 88.2353
hap2 all 1951003169 2124335604 91.8406
both all 2114875456 2124335604 99.5547
This "problem" seems to scale with heterozygosity, where a lower diversity trio has hap1/hap2 ~ both, while a higher heterozygosity trio has hap values of ~84 while both is 99.4. I believe this is e.g. hap2-specific kmers in the reads not seen in the hap1 assembly, and so this completeness value seems to be consequently negatively affected by having e.g. crossbred parents?
Hi @ASLeonard , hope it's not too late to get back!
Yes, completeness is not accounting for heterozygosity. Your example of hap2-specific kmers not seen in hap1 assembly does negatively affect the completeness %.
For trio-assemblies, 'both' makes most sense in my opinion to report the completeness to refer the genome representation of the whole assembly.
Arang
Great, thanks for your insight and the discussion.
Hi again Arang!
I was wondering if you have any interpretation for the difference in k-mer completeness when using the solid k-mers which come from spectra-cn, compared to the hapmer version from spectra-cn. Loosely speaking, the hapmer version always seems higher, but they do correlate very strongly. Particularly for trio-binned assemblies, it seems intuitive that the hapmer version captures that specific haplotype's completeness?
Now that standards seem to adopt something like
Thanks, Alex