refresh-bio / KMC

Fast and frugal disk based k-mer counter
253 stars 73 forks source link

Question on CKMCFiles #147

Open GaryGohYP opened 4 years ago

GaryGohYP commented 4 years ago

Hi, I am trying to open a CKMCFile in listing mode and have it duplicated so that another CKMCFile object can continue reading from where the first one left off. Any advice on how I should do this? Thank you very much!

marekkokot commented 4 years ago

Hi,

thanks for using KMC! Unfortunately, such functionality is currently not supported. We may consider implementing it, but it must be designed properly (the first option that comes to me seems to be providing copy ctor and assignment operator, but there are some issues, like what in case of random access mode, it would require allocating potentially big memory that would just contain a copy of another big memory chunk, so maybe the better option would be creating some kind of view class). Currently, it may be hard to find time to implement this functionality. Could you provide more information of your use case, it would help to design it properly or maybe I could give you some advice without the need to change the current implementation. One simple solution that I think could be implemented quickly is to add function Seek(uint32_t kmer_index) that would change the internal k-mer pointer to a specified index (only for listing mode), but before I will implement it please give me more details about your use case. Thanks again for using KMC.

GaryGohYP commented 4 years ago

Hi, thanks for the help!

I actually want to specify prefixes, say "AA", "CA", "GA", and "TA" and scan all kmers sequentially from each given prefix. This is because I am trying to construct a de Bruijn Graph but it requires merge sorting kmers with the same prefix to get the edge BWT. Originally I had intended to this by scanning sequentially once through to get all the pointers needed, and then run the merge sort. So actually, if there were anyway to list kmers from a given prefix, that would be extremely useful.

Thank you very much for your help!

marekkokot commented 4 years ago

Ok, thank you for explanation, it clarifies things. I wonder if you use kmc_tools to sort k-mers first (if not, please read in the documentation of kmc_tools about sort operation in transform group). The documentation is available in the repository as a pdf file. If you didn't use this operation, maybe it will help you, at least a little.

Are you able to predict the order of prefixes in which you will need to list k-mers? If it is possible in your application to browse k-mers in increasing order of prefixes, it should be easily doable if KMC database would be sorted, without the need to implement new features in KMC API (which would probably take a significant amount of time, although algorithmically adding functionality to list k-mer with a given prefix should be rather simple). Let me know if using sort from kmc_tools would be enough.

GaryGohYP commented 4 years ago

Hi thanks for the quick reply! Yes I first run sort from kmc_tools first on the database. So the database will be sorted. However, lexicographical ordering is not what I want. For a kmer c1c2..ck I need them to be sorted with c1 being the least significant character, followed by ck, c(k-1), .. c2 with c2 being the most significant character. So an example of kmers sorted in this order would be: AAA CAA GAA TAA AAC CAC GAC TAC This ordering is can be obtained by merging kmers with prefixes 'A','C','G', and 'T' based on their suffix.

Also, the prefixes are known before hand. Like in this case since I chose the prefix to be of length 1. But we can achieve the same list by merge sorting prefixes of length 2 in parallel. If it is easy to add the list from given prefix functionality to CKMCFile, maybe you could advice me on how I should go about to do it? I intend to read from the same database from different prefixes in parallel.

Thanks again for the quick reply!

marekkokot commented 4 years ago

Ok, now I think I do understand your idea. Since you sort your kmc database first, it is as a result in KMC1 format (which is described here: https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-14-160/MediaObjects/12859_2012_5911_MOESM1_ESM.pdf) Te number of files opened simultaneously is limited by the ohis format is much easier than KMC2 (which is also a result of KMC in version 3:)). You may try to write your own program to read the database directly, but it will for sure require some work. There is also possibility that there are some values used now in the database that was first reserved for future use, but it should not matter in your case.

The other (simpler, but much less performant) option is to create as many KMCFile Object as there are prefixes and manually skip all k-mers with wrong prefixes and than you should be able to read from these objects. This is rather bad suggestion, especially for long prefix lengths, as many records will be read many times, and for each database there is opened FILE*, and thperating system.

Here (sorry for .txt but github does not allow .cpp files placed in issues) is a simple, fast written, poorly tested implementation of that idea. There are many possible improvements and it may work probably much faster. assuming kmc_api folder is next to this source file you may compile it using:

g++ -std=c++14 -O3 main.cpp kmc_api/*.cpp -o program

Unfortunatelly, I am afraid that for now it is the only thing I can give you, because of other duties. If we decide to add this functionality directly to kmcAPI we would consider its design carefully and test it well (both for correctnes and performance).

Let me know if the solution (rather not very smart) I gave you will help you.

If you decide to write your own software to access kmc database and spot any problems don't hesitate to ask.

GaryGohYP commented 4 years ago

I see thanks for the suggestion! However, I think that this is similar to the temporary fix I have been using which is also to open many CKMCFiles and scan them sequentially from the start till the prefixes. I'll try to read them directly from the database instead. Thanks for the help!