refresh-bio / KMC

Fast and frugal disk based k-mer counter
256 stars 73 forks source link

Computing jacquard distances #113

Open TransGirlCodes opened 5 years ago

TransGirlCodes commented 5 years ago

Hi, I have a set of individuals I want to compute jacquard distances for.

I've produced a kmc database for each individual. One way I thought of to compute these distances is for each pair of individuals, use kmc_tools to make a union and an intersection kmc database.

Then use the number of distinct kmers (could be done with wc -l of the text dump of the two databases) in the two databases to compute the jacquard distance.

Is there an easier way to compute the jacquard score for two KMC databases?

I've also noticed with some of my union and intersection databses, when I try to use transform to dump them as text kmc_tools hits a segmentation fault 11, why might this be?

Thanks!

marekkokot commented 5 years ago

Hi,

To get the number of k-mers in a specific KMC database you may use:

kmc_tools info <db_path>

It prints some information about a specified database. It should be much faster than dump and wc -l. For sure it could be done even more efficient because, in fact, you do not need to store kmc_tools output database in a file, but only the number of k-mers. Unfortunately currently kmc_tools do not support that, maybe we will add this in the future.

Could you please send me some small example input file and commands you use that causes seg fault ?