marbl / meryl

A genomic k-mer counter (and sequence utility) with nice features.
117 stars 14 forks source link

How to get kmer count in query fastq files with kmer database #50

Open leon945945 opened 3 months ago

leon945945 commented 3 months ago

Hi, I generated a kmer database with meryl count command, and I want to count the kmers of the database in different query fastq files, how to do it with meryl. I saw that meryl-lookup did the similar job, but the output file of meryl-lookup is not what I want, I want to compare the number of each kmer in different queries, my expected output is like this:

input1 input2 kmer1 count count kmer2 count count .......

Hope for your reply.

arangrhie commented 3 months ago

Hi,

I would do the kmer counting first, and do an intersection with the kmer database of interest (let's say original.meryl). For example for input1:

meryl count k=$k input1.fastq.gz output input1.meryl
meryl intersect input1.meryl original.meryl
meryl print input1.meryl

The output will give you the counts in input1.meryl, for the kmers that exists in both input1.meryl and original.meryl. Then repeat this for input2, and merge the results.

meryl-lookup adds counts of the input meryl kmer database. It would not do the counting of the input sequence (your fastq reads in your case). If you'd like to use meryl-lookup, and by-pass the intersect / print / manual merging, you could do it the other way around by using input1.meryl and input2.meryl on your original sequence. You could potentially make fake sequences, making each kmer its own sequence from your original.meryl with meryl print result.

leon945945 commented 3 months ago

Thanks for your quick reply. I get the idea of comparing kmer count in multiple fastq files. The kmers whose count number equals to 0 in input1.meryl could be extracted with meryl difference original.meryl input1.meryl command. Due to the large amounts of kmer sequences, maybe the manual merging is a toughh work to do.