Open mr-eyes opened 2 years ago
Hi,
I am not sure if I understand what exactly you want. Do you want for example to run 10% of the input data and get all minimizes found with their counts or something else? Some clarification would be great. Currently there is no simple way to do this, I mean not using the C++ API or CLI, although you may try to modify the code (which may be not very easy and a little time-consuming). Anyway, we are planing to refactor some KMC parts, so maybe also extend its API, so we are looking for suggestions.
Best, Marek
Hi @marekkokot, thanks for the prompt reply!
Well, I can dig into the code, but yes having extended API with refactoring the current code will make things much much easier. What I wanted to do is simply to use KMC as canonical minimizers extractor (as if I don't want to get into the kmer counting step). I am only interested in the minimizers/superKmers to use in other processing. I know that the KMC isn't being developed for that purpose, but I believe the engineering effort done in KMC would make it a very fast kmers/minimizers extraction tool.
I hope I made it more clear that time :)
Ok,
extracting only super-k-mers should be quite easy, as they are stored in the temporary files which are all available after running stage 1. As far as I remember they are deleted after reading in stage 2 (but maybe some additional parameter must be passed, I'm not sure without looking into the code, if you will need help on that let me know).
They are in binary format (but this format is currently quite simple, it may be changed in the future though).
The format is (roughly):
[a_1][super-k-mer_1][a_2][super-k-mer_2]...[a_n][super-k-mer_n]
for a file containing n super-k-mers.
a_1
is a number stored on 1 byte. k + a_i
is the length of i
-th super-k-mer in a file (so a_i is the number of additional symbols (above k)).
Internally we also keep in memory some additional info for accessing the file in parallel, but as for now maybe just read the file sequentially.
Important details:
Out of curiosity, why do you need super-k-mers/minimizers?
Hi,
Is there a way to dump the
canonical minimizers
orsuperKmers
without having to reach the final stage? I have read the API docs but didn't find an exposed class to achieve it. I would appreciate any leads on that!Thank you,