marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

Help with interpreting spectra-cn and spectra-asm plots with Illumina WGS reads #128

Closed kaede0e closed 3 weeks ago

kaede0e commented 3 weeks ago

Hi authors, I am currently working on a de-novo genome assembly, haplotype resolved, for a diploid plant organism using PacBio HiFi and Hi-C reads. I wanted to check my assembly quality using Merqury as my BUSCO scores were not ideally high (C: 92.4-92.6% for both haplotypes, no improvement for Hap1+Hap2: C: 92.6%), and so we acquired some WGS Illumina data from our collaborators to run the program. I was able to generate the kmer plots; however, they don't look normal. Nettle_female_Round_2_asm_trimmomatic_meryl_output_axis_adjusted-asm st Nettle_female_Round_2_asm_trimmomatic_meryl_output_axis_adjusted-cn st

My interpretation from the -cn.st plot is that the average homozygous coverage was ~22X, and that I should expect a heterozygous peak around ~11X. But I don't understand why there is another peak that kind of overlaps with the heterozygous coverage peak for the kmer sets that show up twice (blue region). Do you have a good explanation of why I am observing two peaks in two copies? When you look at that in -asm.st plot, there also appears two peaks in shared kmers (green), which is also concerning.

At first I was thinking maybe high level of heterozygosity affects the lower-coverage shoulder on the kmer plot, but then it is quite significant (to the point that it is higher peak than the homozygous peak) so I wanted to hear your advice on this interpretation.

Thank you for your help, Kaede

arangrhie commented 3 weeks ago

Hello @kaede0e ,

Can you post the spectra-asm and cn .ln plot? The ln (line) plots are better for looking at the distribution.

arangrhie commented 3 weeks ago

Anyways, to me it looks like the completeness is not so much to worry about. If there was something missing, you'd see a peak in the read-only (black) area. If BUSCO database is not exactly the same as your species of interest, it is possible to get inaccurate completeness metric due to the gene sets used in that organism. Alternatively, you could use the completeness % produced by Merqury.

For the false duplication, if you still see a green (shared) peak in the 11x of the spectra-asm line plot, and a blue peak in the 11x of the spectra-cn line plot, that would be concerning. It might be just a bleed-in from the 2nd peak that looks more exaggerated because of the high level of heterozygosity (composed of true 1-copy kmers) in the stacked (st) plot. This is the reason why I don't prefer seeing these stacked plots.

Best, Arang

kaede0e commented 3 weeks ago

Hello Arang,

Thank you so much for a quick reply. Ah, okay now I am looking at the line plot and it looks like the 11X peak was inflated because of the stacked situation. I didn't realize the stacked plot was actually stacking the curves; this looks a lot better now. Nettle_female_Round_2_asm_trimmomatic_meryl_output_axis_adjusted-asm ln Nettle_female_Round_2_asm_trimmomatic_meryl_output_axis_adjusted-cn ln

The completeness score I got from Merqury on the Hap1+Hap2 was 98% so I now feel comfortable saying that the assembly was reasonably complete.

Thank you so much, Kaede