Whether high ploidy affects the assessment of completeness

jinxin112233 commented 3 years ago

Hi

We evaluated the diploid genome, the result of qv is ~60, the result of k-mer completeness ~95. However, when we evaluating the high ploidy genome(same species, different ploidy), the result of qv is also ~60, k-mer completeness was only 70-80.

On one hand , We think that, the assembly of a high-ploidy genome is more difficult than diploid genome, the quality of the resulting genome is relatively low, so the k-mer completeness is not very good. On the other hand, Merqury is mainly used to evaluate the quality of diploid genomes. whether Merqury can perform well in evaluating high ploidy genomes?

Best JX

arangrhie commented 3 years ago

Hello JX,

The spectra-cn plot will show the missing kmers in black that affected your completeness. Might be better to check that first. Also, there will be a file named *.filt . That is the cutoff to disregard low frequency kmers. This cutoff sometimes does not work for polyploid genomes. Will be best to double check both the spectra-cn plot and this cutoff.

Thanks, Arang

jinxin112233 commented 3 years ago

Hi Arang Thank you for your help. Here is the spectra-cn plot. out_prefix spectra-asm fl

And the cutoff value is 7. Do we need to adjust the cutoff value?

Best JX

arangrhie commented 3 years ago

Hi JX,

The cn plot looks good, 7 seem to be a reasonable cutoff. Seems like a highly heterozygous genome?

If you'd like to adjust, you could possibly check where the hap2 cn plot starts to exceed the read-only counts from the spectra-asm.hist and use that as the cutoff. It is the low sequence coverage that makes it hard to distinguish the erroneous from the peaks.

Looks like a nice assembly.

Best, Arang

jinxin112233 commented 3 years ago

Hi Arang Thank you for your quick reply and suggestion. Yes, it is a highly heterozygous genome. We use We evaluate its heterozygosity is ~3% by genomescope. The NG50, LAI value, busco completeness, collinearity, looks good, but the k-mer alone is not. So we want to explain it why. Is it because our sequencing depth of the heterozygous genome is not enough, so the low sequence coverage that makes it hard to distinguish the erroneous from the peaks? or another explanation？(HIFI ~30x, Illumina ~100x ,Similar results were obtained for both types of data). or another suggestion?

Best， JX

arangrhie commented 3 years ago

Just catching up now. Are the cn-plots shared from the Illumina 100x? Do you have the plots from HiFi? Is there a chance that the kmer set contains contaminants / other organelles not included in (or removed from) the assembly? If you could share the spectra-cn.hist files, I could take brief look to see where the majority of the missing kmers are coming from. I am speculating that the high copy region may had some sequences not present in the assembly.

jinxin112233 commented 3 years ago

hi Thank you for your help Here is a another genome which meet the same problem. Here is the file. file.zip

Best JX

arangrhie commented 3 years ago

I see the problem, you would need to report the completeness for both (98.3 ~ 98.8%). The kmers used for evaluation contains all kmers from the diploid genome. The missing 22~24% likely reflects the portion belonging to the other haplotype, given the high overall completeness for 'both' haplotypes.

Congrats again, seems like a very descent assembly.

Cheers, Arang

jinxin112233 commented 3 years ago

Hi Arang

Wow~ Great！It seems like a very great assembly now~ Your patient answers made me understand where my problem is.

Thank you for taking your time to solve my problem. Best, JX

marbl / merqury

Whether high ploidy affects the assessment of completeness #46