refresh-bio / KMC

Fast and frugal disk based k-mer counter
253 stars 73 forks source link

Creating histogram of kmer presence in genomes #182

Open lucaparmigiani opened 2 years ago

lucaparmigiani commented 2 years ago

Hi, first of all, thanks for the tool, it is indeed pretty fast and efficient!

I wanted to ask you how to achieve the following:

I have N genomes (assembly). I want to get an histogram of how many kmers appear in 1 Genome, 2 genomes...N genomes. So far I cannot do this, even with kmc_tools, since producing an output with -cs1 will make it not readable for kmc_tools.
If I use -cs > 1 then I cannot apply the union with -ocsum on the two genomes (If I have a kmer appearing twice in the first genome but 0 times in the second genome this will result as if it is present in both).

marekkokot commented 2 years ago

Hello,

thank you for using KMC. I need to implement handling non-counters in kmc_tools... I'm afraid it may take some time due to other responsibilities.

For now, I have some tricky workaround. You may run kmc without -cs1 and having k-mer database set all counts to 1 with kmc_tools (I will not cause not storing counters in the database, so kmc_tools will not refuse it). Here are example command lines:

kmc -fm -ci1 1.fa o1 .
kmc -fm -ci1 2.fa o2 .

kmc_tools transform o1 set_counts 1 o1_1
kmc_tools transform o2 set_counts 1 o2_1

kmc_tools simple o1_1 o2_1 union o3

Let me know if it works for you. I wonder how it will hurt the performance of your pipeline. If it will be too much let me know and maybe I will be able to prioritize this issue. Another possible (although not recommended) walkaround is to use some older version of KMC).

Please do not close this issue even if it works. I will remind me that it needs to be implemented correctly.

Best, Marek

lucaparmigiani commented 2 years ago

Thanks for the quick response!

Since I needed this for my experiments I modified a version of yak-count.c and use that for the moment. Thanks for the solution you suggested, I double checked my results with your method so that I was certain I was getting the right numbers.

This "histogram of genome occurence" is function that I definitely would love to see in KMC at some point.

Thank you, Luca