gmarcais / Jellyfish

A fast multi-threaded k-mer counter
Other
462 stars 136 forks source link

Counting only a limited k-mer set #124

Closed MaximilianStammnitz closed 6 years ago

MaximilianStammnitz commented 6 years ago

Hi Guillaume,

Glad to have come across Jellyfish today - it looks like this does exactly the k-mer counting job on sequencing data that we are needing in our project!

Before going ahead with installing, I was wondering: is it possible to further speed up your tool by only feeding in a list of k-mers of interest (all same length), rather than looking at all possible ones at the same time? In particular, I am only interested in ~ 3k different keys and their reverse-complements that can give us information; anything else could be dumped on the fly. I imagine this will also be quite useful to other members of the community.

Please let me know if there's a way to easily feed in a pre-designed K array, I'm happy to test and report back.

Cheers, Max

gmarcais commented 6 years ago

Yes, this is doable with the --if switch (which stands for "in file"). For example, suppose I want to count the number of occurrences of the 20-mers of chromosome 20 in chromosome 1 of human, I would do the following:

jellyfish count -m 20 -s 100M -C -t 10 -o 20and1.jf --if chr20.fa chr1.fa

So this reads all the canonical 20-mers from the file chr20.fa (but don't count them) and loads them in memory. Then it reads chr1.fa and counts the number of occurrences of the 20-mers only of the k-mers already loaded in memory.

If you have a list of k-mers instead of a sequence, you can create a fake fasta file with 1 k-mer per entry.

Now, the histogram will report k-mers with a 0 count (the 20-mers of chr20 which do not exists in chr1):

$ jellyfish histo 20and1.jf | head 
0 49401674
1 1116213
2 425471
3 250702
4 170160
5 124391
6 102038
7 74168
8 57631
9 47173
MaximilianStammnitz commented 6 years ago

Sounds great, thanks so much for your quick reply, I will test it in this way and get back to you soon.

Would recommend adding this functionality to the User's Guide. It's definitely quite helpful for anyone with not so well-assembled genomes or for researchers interested in genomic regions of poor alignment. Unix rookies like me may intuitively grep fastq files with multiple k-mer iterations and waste lots of time...

gmarcais commented 6 years ago

Glad it helped.

I added basically the instructions above in the Readme.md of the doc directory.