Merqury over-estimates genome completeness

bredeson commented 4 years ago

Hey, Great tool, it's very useful and easy to use!

I ran Merqury on an Illumina-based assembly I had and obtained a completeness estimate much larger than I expected, given the genome's known size. I think I found a semantic bug in it's completeness calculation:

When calculating the size of a genome with k-mers, the calculation must take into account the multiplicity of the k-mer as well as its frequency in order to normalize for genomic copy number. As Merqury is currently written, it only sums up the frequency and doesn't take copy number into account (it is basically assuming all k-mers have a copy number of 1 in the genome).

The equation for calculating the fraction of genome assembled is:

G = k_d * sum [ k_i * f_a(k_i) ] / sum [ k_i * f_r(k_i) ] ,  for i in 1 -> n

where: G is the fraction of genome assembled (i.e., the completeness) k_i is the value at multiplicity i k_d is the value of the diploid peak multiplicity f_a(k_i) is the frequency at multiplicity k_i in the assembly f_r(k_i) is the frequency at multiplicity k_i in the reads

See spectra-cn.sh, lines 132-134:

TOTAL=`meryl statistics $read_solid | head -n3 | tail -n1 | awk '{print $2}'`
ASM=`meryl statistics $asm.solid.meryl | head -n3 | tail -n1 | awk '{print $2}'`
echo -e "${asm}\tall\t${ASM}\t${TOTAL}" | awk '{print $0"\t"((100*$3)/$4)}' >> completeness.stats

If we let $dmult to be the full-depth diploid peak multiplicity, then the above lines could be modified to instead be:

TOTAL=`meryl statistics $read_solid | tail -n +11 | awk 'BEGIN{s=0} {s+=($1*$2)} END{print s}'`
ASM=`meryl statistics $asm.solid.meryl | tail -n +11 | awk 'BEGIN{s=0} {s+=($1*$2)} END{print s}'
echo -e "${asm}\tall\t${ASM}\t${TOTAL}" | awk -v d=$dmult '{print $0"\t"((100*d*$3)/$4)}' >> completeness.stats

I hope this is useful.

Best, Jessen

arangrhie commented 4 years ago

Hi Jessen,

Thanks for your thoughtful feedbacks.

i) Larger genome size than expected Currently, as you might have noticed, the completeness score is computed based on 'distinct' k-mers, with no weights on the ploidy. The TOTAL k-mers you find will therefore be larger than your expected genome size if the heterozygous k-mers take a larger fraction than the k-mers from the repeats. This typically is seen on genomes with high level of heterozygosity.

ii) Including ploidy for weighting I agree the proper way is to take the ploidy into account when estimating completeness. We thought about this for a while, but decided to leave it as a future work as it is difficult to extend the weighting properly for repeats. Distinguishing true k-mers from errors in low frequency is another issue.

I'll wrap my head around for this, marking your suggestion as 'enhancement'. For some genomes, it is not so easy to automatically identify the 'diploid' peak and requires manual inspection. I will probably implement this as a separate script, taking the 'diploid' peak as an option.

Thanks again! Arang

bredeson commented 4 years ago

Hi @arangrhie, Just a quick note regarding point (i): The plant genome I'm working with is three generations in-bred and was confirmed to be ~93% homozygous by multiple methods, so heterozygosity can only be a minor contributor in this case. There was a whole-genome duplication ~40Mya and gives rise to a 4x peak, but is relatively minor at a 21-mer level. The Illumina library is a PCR-free prep with plastid and mitochondrial sequences removed. I'm attaching my spectra plot (see below).

Regarding (ii): My past experience trying to do this in an automated way leads me to agree with you. The methods implemented in GenomeScope2 are the best I've seen in this regard.

cassava-v6.merqury.cassava-v6.spectra-cn.fl.pdf

arangrhie commented 4 years ago

Hi @bredeson,

Thanks for sharing your spectra-cn plot! This helped me understand your assembly. Regarding (i) - it seems like your completeness got affected by the little missing k-mer hump (black) at ~37x (het peak) and 76x (2-copy peak). In overall, it looks like your k-mer copies show a nice haploid representation of your plant genome. My point regarding heterozygosity is that in (pseudo) haploid assemblies, we expect the missing het peak to be shown, which will affect the completeness. Even in highly inbred genomes, there are always some part of the genomes that are still remaining heterozygous, such like the MHC cluster. But, this all comes from my limited experiences in (relatively easy) diploid vertebrate genomes. I'm sure there will be interesting genomes behaving differently.

As a side note: If you are interested to investigate further on the missing k-mers, you could pull out the reads having those missing k-mers and do a local assembly.

meryl difference read.meryl asm.meryl output read.0.meryl
meryl greater-than X read.0.meryl output missing.meryl
meryl-lookup -include -sequence <sequence.fasta> -mers missing.meryl > missing.fasta

missing.fasta will contain reads having the missing k-mers with >X multiplicity from your spectra-cn plot. The missing sequences could probably coming from the other haplotype, but who knows.

And (ii) - I agree GenomeScope2 is a very nice tool in this regard. Combining it with Merqury might be a nice future work. Thanks for your hint.

Arang

marbl / merqury

Merqury over-estimates genome completeness #2