marbl / merqury

k-mer based assembly evaluation
Other
275 stars 19 forks source link

NGS reads and Hifi reads which will be better for running Merqury? #31

Open Simon-Huang1 opened 3 years ago

Simon-Huang1 commented 3 years ago

Hi, I assemble a genome using Hifi reads. However, I am confused that NGS reads and Hifi reads which might be better for qualifying our genome. I thinks NGS reads will be biased on high GC and repetitive region which might not be perfect for qualifying genome, but in the cookbook of Merqury I only see explanation on how to use NGS reads to run Merqury. So here I ask about this question that NGS reads and Hifi reads which will be better for running Merqury?

arangrhie commented 3 years ago

Hi @Simon-Huang1 , although it is know that Illumina reads do have GC biases, the HiFi reads are still a bit error prone in homopolymers, which affects a larger amount of k-mers. You could make a hybrid kmer db, combining kmers from both Illumina and HiFi, for measuring QV. I would still recommend to use Illumina for other metrics; such as completeness and spectrum analysis.

ASLeonard commented 3 years ago

Hi Arang, To follow up on this, do you recommend the HiFi+Illumina hybrid db only for QV, while using an illumina only db for the other metrics? After a quick check, illum only has QV50, while hybrid is closer to QV65. Maybe it is due to greater filtering value (gt3 for illum but gt16 for hybrid), or if there are hifi biases in the assembly that now exist in the hifi-biased db?

hybrid   29689  3067099605  63.3635 4.60946e-07
illum_only    1788932   3067099605  45.5623 2.77822e-05

This relates as well to the new merfin best practices, where illumina reads are suggested for the db. I guess in this context as well, illumina only is preferred over hybrid?

arangrhie commented 3 years ago

Hi @ASLeonard ,

I assume the illum_only also comes from a filtered version? It is not possible to entirely remove biases; I'd say provide QVs measured from each platforms and the hybrid set. Ultimately it is an estimation from a given sequencing platform, and all data are at least supporting that the assembly is in high quality.

For Merfin, I appreciate your very fast access :) I just posted it and you are asking about it next day! Merfin relies on the k-mer multiplicity so it is difficult to reliably measure this from a hybrid set. Although Illumina is known for its GC biases, the effect of homopolymer / microsatellite (simple tandem) errors in HiFi was more genome-widely affecting the kmer spectrum, making it difficult to make accurate copy number estimates compared to Illumina. So yes, I would say use Illumina for Merfin.

Merqury's QV is less affected as it does not account for the expected multiplicity.