refresh-bio / KMC

Fast and frugal disk based k-mer counter
264 stars 73 forks source link

API for KMC database construction #77

Open karasikov opened 6 years ago

karasikov commented 6 years ago

Dear developers of KMC,

It would be great if there was an independent interface for building KMC database similarly to reading implemented in kmc_api/kmc_file.

marekkokot commented 6 years ago

Hi,

Interesting idea, unfortunately probably I will not be able to implement it in the nearest future, but I hope someday I will :)

Could you confirm if my understanding is correct? You want something like:

CKMCFile kmc_file;
kmc_file.ConstructDatabase(/*bounch of parameters that are equivalent to KMC command line parameters*/);

As a result it would create KMC database in memory, without storying it on disk, and in case of not sufficient memory it would just fail, in case of success this kmc database would be opened in radnom access mode. Or maybe such function would create regular KMC database on disk and open it in listing mode)?

karasikov commented 6 years ago

Thanks for your reply.

You want something like:

Exactly, probably with a second version where, instead of input files, we pass a function that generates reads and this not necessary should happen in memory, but we could also provide an optional parameter with a path for temporary storage on disk:

kmc_file.ConstructDatabase(const std::vector<std::string> &input_filenames,
                           const std::string &database_basename,
                           const std::string &cache_dump_path,
                           size_t k, size_t max_RAM,
                           /*other parameters*/);

kmc_file.ConstructDatabase(std::function<std::string()> reads_generator,
                           const std::string &database_basename,
                           const std::string &cache_dump_path,
                           size_t k, size_t max_RAM,
                           /*other parameters*/);

I could potentially adapt the code from kmer_counter.cpp, but unfortunately, that seems to be complicated at this moment. I think adding the two functions above would add a new use-case for KMC API and it would significantly improve the applicability of KMC as an external library for counting k-mers.

My current solution: 1) run kmc externally to count k-mers and generate database, 2) then my program uses the KMC API for filtering reads and other stuff.

Obviously, This is not perfect, and ideally, everything should happen in one run. Although it works for now, I will have to reconsider this solution for the release version.