refresh-bio / KMC

Fast and frugal disk based k-mer counter
252 stars 73 forks source link

kff to jf (or vice versa) #228

Closed JosephLalli closed 5 months ago

JosephLalli commented 5 months ago

Hello,

For every sample in my dataset, I need to run two tools. One tool accepts kff file formats, while the other accepts jellyfish's .jf format counts.

Especially for the second tool, kmer counting takes up the majority of runtime.

To avoid counting kmers twice, I'd like to use one tool to count kmers, and then convert the counts to the other file format. Is there a method to convert kff formatted counts to jf? Or from jf to kff?

I know jellyfish can dump jf files into a tsv file, but jf -> tsv -> kff seems computationally expensive.

Best, Joe Lalli

marekkokot commented 5 months ago

Hi,

Out of curiosity, what is the second tool (and what is the first one)? If the k-mer counting for the second tool takes the majority of the time (I guess you are running Jellyfish?), how would converting from kff to jf help?

Anyway, I am afraid such a conversion is currently not supported. jf files are in a totally different format. Probably writing tool more efficient than jf -> tsv -> kff is possible, but we don't have this in our current plans. Maybe output in KFF should be added to Jellyfish?

Best Marek

JosephLalli commented 5 months ago

The first tool is vg giraffe, which works with kff formatted files, and the second tool is pangenie, which uses jellyfish internally to count kmers..

While one can run kmer counting separately from pangenie, it was not designed with that use case in mind. So, when providing kmer counts instead of raw fasta files, it only accepts .jf formatted counts.

I also agree that jellyfish should export kff format. I'll raise this as an issue with both packages.

Thanks Marek for the wonderful tool, and your prompt reply!

Best, Joe