marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

How to get best k for genernate read-db.meryl? #101

Closed cjchen5 closed 1 year ago

cjchen5 commented 1 year ago

Hi, For genernate read-db.meryl, I found here (https://github.com/marbl/merqury/wiki/1.-Prepare-meryl-dbs) mentioned best_k.sh but this seems only consider genome size. However, in "Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies" shows:

When differentiating the histogram, we compute the slopes and the first k-mer multiplicity with a positive slope defines the reliable k-mer threshold. Examples of these cutoffs are shown as dashed lines in Fig. 3c, d.

Does this need sequencing reads? If so, how should I apply sequencing reads with 'merqury' when I calculate best k?

Thanks!

arangrhie commented 1 year ago

Hello,

The "best_k.sh" is to help getting the minimum k size to generate read-db.meryl, given the genome size.

The "k-mer threshold" is for getting a cutoff for obtaining a reliable k-mer subset from read-db.meryl. The threshold is automatically determined given the k-mer histogram of read-db.meryl in Merqury.

So yes, you need sequencing reads to generate read-db.meryl. Once you know the "best k", prepare meryl-dbs using the "best k" as in the document with meryl count. Merqury will do the rest, in most cases, unless the histogram is somewhat unexpected.

Thanks, Arang