refresh-bio / KMC

Fast and frugal disk based k-mer counter
256 stars 73 forks source link

How to get the bit representation from KMC-API #118

Closed voichek closed 5 years ago

voichek commented 5 years ago

Hi,

I am using KMC-API in my code to load kmers DBs create by KMC. I am using kmers of length <= 31, and I want to get the kmers in bit representation.

I can use _get_numsymbol, but due to efficiency consideration I would like to get the k-mer in the minimum number of operations.

I thought of this workaround, and I wanted to get feedback if it is correct and if there is other more natural way to do the same thing:

class CKmerAPI_upto31bp: public CKmerAPI {
public:
        CKmerAPI_upto31bp (uint32 length = 0): CKmerAPI(length),
        m_shift(64 - (((kmer_length - 1 + byte_alignment) % 32) * 2) -2) {
            if(length>31)
                throw std::invalid_argument("k-mer length should be <=31");
        }
        uint64 to_uint() {return (uint64)kmer_data[0] >> m_shift;}
    private:
        uint32 m_shift;
};

Thanks for the help, Yoav Voichek,

marekkokot commented 5 years ago

Hi,

This seems to be a special case of inline void to_long(std::vector<uint64>& kmer) method with some edge cases eliminated, so in my opinion, this is quite nice workaround :). If its performance is still not enough for your purposes, you may try to define your own class to represent k-mers and read the database. It is probably a quite time-consuming task, but if you want to, than read the kmc database format describtion in the docs. As your case (k<32) is simpler maybe you will be able to reduce some more edge cases. The priority of KMC API is its flexibility and ease of use, the performance is the second criterion. In general, I don't think you may gain a huge performance boost creating your own implementation to access KMC DB, but some boost should be possible. If you describe your use case more in detail, I may try to help you. Do you access the database in the random access mode or listing mode? In the first case, you may try, for example, sort the database using kmc_tools, because in some cases querying k-mers may work faster if the database is sorted.

Best, Marek

voichek commented 5 years ago

Dear Merek,

Thank you for your response.

The current (_CKmerAPIupto31bp) implementation provides me with the performance I need. I was worried that the workaround might be incorrect in some cases and wanted to make sure I am not missing anything.

Thanks again, Yoav