marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

Differences between reporting solid kmer or hapmer based completeness for trios #57

Closed ASLeonard closed 2 years ago

ASLeonard commented 2 years ago

Hi again Arang!

I was wondering if you have any interpretation for the difference in k-mer completeness when using the solid k-mers which come from spectra-cn, compared to the hapmer version from spectra-cn. Loosely speaking, the hapmer version always seems higher, but they do correlate very strongly. Particularly for trio-binned assemblies, it seems intuitive that the hapmer version captures that specific haplotype's completeness?

image

Now that standards seem to adopt something like

> 90% kmer completeness It seems important as sometimes a solid k-mer value will be below that, while the hapmer value way above it. It looks like the VGP uses solid k-mers (Table S19), but the trio-based zebra finch also has the lowest completeness listed, again maybe something hapmer completeness is better for?

Thanks, Alex

arangrhie commented 2 years ago

Hello Alex!

Solid k-mer completeness are measuring based on 'all' k-mers in the genome. The solid k-mer completeness is measuring how much of the genome is assembled / the expected genome inferred from k-mers.

Hapmer completeness measures based on haplotype specific mers, how complete the haplotype specific k-mers are captured (phased) in the assembly. These hapmers majorly overlap with a subset of solid k-mers, but I won't use it to replace the solid k-mer completeness, as this is only measuring part of the genome.

I think it's more accurate to say in your plot that haplotype specific sequences are well captured in its near completeness while solid k-mer completeness indicates some k-mers shared by both parental genomes are missing in the assemblies.

Arang

ASLeonard commented 2 years ago

Ah yeah makes more sense. I was starting to view hapmers as essentially the solid k-mers overlapping with that parent, rather than unique parental k-mers.

On a related note, here is an example from a HiFi-based trio, ~ 3gb size. The combined completeness is extremely high, but the individual haplotypes are a fair bit lower.

hap1   all     1874413966      2124335604      88.2353
hap2   all     1951003169      2124335604      91.8406
both    all     2114875456      2124335604      99.5547

This "problem" seems to scale with heterozygosity, where a lower diversity trio has hap1/hap2 ~ both, while a higher heterozygosity trio has hap values of ~84 while both is 99.4. I believe this is e.g. hap2-specific kmers in the reads not seen in the hap1 assembly, and so this completeness value seems to be consequently negatively affected by having e.g. crossbred parents?

arangrhie commented 2 years ago

Hi @ASLeonard , hope it's not too late to get back!

Yes, completeness is not accounting for heterozygosity. Your example of hap2-specific kmers not seen in hap1 assembly does negatively affect the completeness %.

For trio-assemblies, 'both' makes most sense in my opinion to report the completeness to refer the genome representation of the whole assembly.

Arang

ASLeonard commented 2 years ago

Great, thanks for your insight and the discussion.