refresh-bio / KMC

Fast and frugal disk based k-mer counter
266 stars 73 forks source link

Extending `MAX_K` #156

Closed jamshed closed 3 years ago

jamshed commented 3 years ago

Hello @marekkokot, thanks again for the awesome library! So @rob-p and I are back with another issue: it's stated in the README that

If needed, you can also redefine maximal length of k-mer, which is 256 in the current version.

So, we were looking to use some larger k-values, by setting the MAX_K macro to 512, here. But, KMC doesn't compile with the altered MAX_K, with some array subscript out-of-bound error for the BUFFER_WIDTHS array (defined here). The errors make sense as with a larger MAX_K, some indexings into the BUFFER_WIDTHS array definitely go out-of-bound, here.

At a glance, it seems that defining this BUFFER_WIDTH array with a larger size and appropriately set values should be able to solve the issue. We'd like to have the option to use k-values higher to 256, and it will be great if you may have a look into this!

Regards.

marekkokot commented 3 years ago

Hello, you are right, I must fix this issue. The problem is related to the sorting algorithm, in fact, I have a newer version of it, which works for any buffer size, but it is not quite well suited to work inside kmc. I must transfer the technique used there to the version used in KMC to fix this issue. It is rather straightforward and simple, but I am afraid for now I have some other responsibilities. Nevertheless, I would appreciate if you could try to use the following values:

constexpr int32_t BUFFER_WIDTHS[] = { -1, 32, 16, 16, 8, 8, 4, 8, 4, 8, 4, 8, 2, 8, 4, 8, 1 };

Let me know if it worked. If you need even greater values of k append the last 8 values of the array to its end.

Thanks for using KMC and reporting this issue. Regards.