gmarcais / Jellyfish

A fast multi-threaded k-mer counter
Other
463 stars 136 forks source link

If counter-len is one, can jellyfish tell apart k-mers with one or more counts? #57

Closed endrebak closed 8 years ago

endrebak commented 8 years ago

Sounds like it should not be able to since it can only separate 0 and 1 (I'd test it myself, but our server is down...).

gmarcais commented 8 years ago

It is a little more complicated than that. In case of a counter overflow, more than one entry in the hash table will be used for a k-mer. So there is no real upper bound on the count associated with a given k-mer, regardless of counter-len (as long as counter-len is > 0). counter-len == 0 can be used internally to represent a set.

In practice, you should use counter-len large enough to accommodate the count of most of your k-mers. For example if 2^counter_len > 2 * coverage, few k-mers will 2 entries in the hash.

It will work whatever the setting used for counter-len, but you could pay a price in speed and memory consumption.

HTH.

endrebak commented 8 years ago

That seems to fit with the testing I originally did (cl 1 took twice as long as default settings).

Btw. Is jellyfishes kmer counts exact? I compute the effective genome fraction with (number unique kmers in genome divided by genome length) and get somewhat different results than published (perhaps the published results from 2009/10 are inexact).

gmarcais commented 8 years ago

The counts are supposed to be exact. I do hope I don't have a bug that lasted all these years.

endrebak commented 8 years ago

I have no reason to think (any potential) mistakes are on your side. Thanks for the software!