How to hack KMC source to implement variant of k-mer counting

rcedgar commented 2 years ago

I have an application where I need to count pairs of k-mers separated by a fixed distance. This is almost identical to K-mer counting where K=2k, the "only" change should be in the scanning step where k-mers are enumerated in an input sequence. It would be a huge help if I could hack KMC to do this while preserving the database format etc. as your code has many features I'll probably need. Can you give me some hints about where & how to make the necessary changes? Thanks!

marekkokot commented 2 years ago

Hello,

thanks for using KMC and posting this idea! There are many internal dependencies in KMC code, so it may be hard/impossible to point just one place in a code to alter to get what you need. I understand the general idea, but I would need to ask some questions about the details.

How big can the fixed distance be? As I understand for you example K=2k the gap is 0.
What about canonicality - do you want each k-mer of a pair to be in canonical form?
What do you want to do with these pairs? Do you need random access, if yes in what form, i.e. would you like to query for a whole specific pair or for one k-mer of a pair? Do you want to perform some operations with kmc_tools?

It would be great if you could also provide some examples, i.e. some simple sequence, the k, the gap size(what if there is an N in the gap?), and the expected outcome. If you would also show by example what operations you want to perform on the output it would be great!

KMC is highly optimized, which makes some changes rather time-consuming to implement.

rcedgar commented 2 years ago

I need a general-purpose tool which would be the first step in a longer pipeline set up by an end-user so a lot of flexibility is needed. This tool would be similar to KMC with additional options -k1, -k2 and -R where k1=first kmer length, R=spacing between the k-mers, and k2=second kmer length. Input could be any type of sequence file. Options such as canonical on/off and operations like kmer_tools would be needed. I was hoping that after an initial scanning step then the k1+[R]+k2-mer could be processed identically to a contiguous k1+k2-mer by the rest of the KMC code, but of course I realize this might not be so simple. Very likely, I'm going to need other tweaks, so it might be easier for me to re-implement something similar KMC rather than fork your code. The high level of optimization in your code is an advantage, but I can probably write something good enough in a reasonable time.

marekkokot commented 2 years ago

Hello,

I see. Thanks for being a little more detailed! I think the approach you described probably makes sense in terms of applicability in KMC, although there are some details that may influence some design choices. If you want to try to alter KMC code I think you will need to modify these parts: https://github.com/refresh-bio/KMC/blob/master/kmc_core/splitter.cpp#L555 - this scans the sequences and splits them into super-k-mers that are stored on disk https://github.com/refresh-bio/KMC/blob/master/kmc_core/kb_sorter.h#L562 - there super-k-mers are split into (k,x)-mers that are further sorted Probably these are not the only parts where modifications will be required, but it seems these are the most crucial ones. If you have any questions don't hesitate to ask. Good luck! If you decide on your own implementation from scratch it would be great if you could show us the performance of your tool. If for some reason you will change your mind and want us to get involved in the project let us know. In such a case we will for sure need more details and probably it would be better to move the discussion to e-mail.

rcedgar commented 2 years ago

Could be ideal if you guys were interested in adding this feature to KMC, by all means email me https://drive5.com/about.html or even better might be invite me to a slack channel.

refresh-bio / KMC

How to hack KMC source to implement variant of k-mer counting #193