refresh-bio / KMC

Fast and frugal disk based k-mer counter
277 stars 72 forks source link

Dumping minimizers/superKmers/signatures #188

Open mr-eyes opened 2 years ago

mr-eyes commented 2 years ago

Hi,

Is there a way to dump the canonical minimizers or superKmers without having to reach the final stage? I have read the API docs but didn't find an exposed class to achieve it. I would appreciate any leads on that!

Thank you,

marekkokot commented 2 years ago

Hi,

I am not sure if I understand what exactly you want. Do you want for example to run 10% of the input data and get all minimizes found with their counts or something else? Some clarification would be great. Currently there is no simple way to do this, I mean not using the C++ API or CLI, although you may try to modify the code (which may be not very easy and a little time-consuming). Anyway, we are planing to refactor some KMC parts, so maybe also extend its API, so we are looking for suggestions.

Best, Marek

mr-eyes commented 2 years ago

Hi @marekkokot, thanks for the prompt reply!

Well, I can dig into the code, but yes having extended API with refactoring the current code will make things much much easier. What I wanted to do is simply to use KMC as canonical minimizers extractor (as if I don't want to get into the kmer counting step). I am only interested in the minimizers/superKmers to use in other processing. I know that the KMC isn't being developed for that purpose, but I believe the engineering effort done in KMC would make it a very fast kmers/minimizers extraction tool.

I hope I made it more clear that time :)

marekkokot commented 2 years ago

Ok,

extracting only super-k-mers should be quite easy, as they are stored in the temporary files which are all available after running stage 1. As far as I remember they are deleted after reading in stage 2 (but maybe some additional parameter must be passed, I'm not sure without looking into the code, if you will need help on that let me know).

They are in binary format (but this format is currently quite simple, it may be changed in the future though). The format is (roughly): [a_1][super-k-mer_1][a_2][super-k-mer_2]...[a_n][super-k-mer_n] for a file containing n super-k-mers. a_1 is a number stored on 1 byte. k + a_i is the length of i-th super-k-mer in a file (so a_i is the number of additional symbols (above k)). Internally we also keep in memory some additional info for accessing the file in parallel, but as for now maybe just read the file sequentially.

Important details:

Out of curiosity, why do you need super-k-mers/minimizers?