marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

Low completeness, high QV #90

Closed mpalmada closed 1 year ago

mpalmada commented 1 year ago

Hi Arang,

I have a lot of Illumina data that lead to high amounts of read_only k-mers, which lowers my completeness even if I have a high QV. Is there a way to standarize the completeness by the sequencing depth?

Thanks a lot!

Marc

arangrhie commented 1 year ago

Hello Mark,

High QV is not necessarily correlated with high completeness. QV is looking at the k-mers present only in the **assembly** + k-mers present in both assembly and reads, while completeness looks at k-mers present only in **reads** + k-mers present in both assembly and reads.

In most cases, ignoring k-mers with frequency=1 is applicable. That is, filtering the read set with

meryl greater-than 1 reads.meryl reads_filt.meryl

and re-running Merqury with reads_filt.meryl.

If the coverage is high enough, the k-mer spectrum (spectra-cn) usually shows a good distinction between the low-coverage errorneous region vs. 1-copy region. You may increase the cutoff for filtering out low-frequency kmers, however this is not something generalizable, as sequencing depth and error profile varies among different sequencing runs. Also keep in mind that the chance of missing a true k-mer is increasing by increasing the cut-off.

Best, Arang

arangrhie commented 1 year ago

Closing this for now. Feel free to re-open if you need more help!