gmarcais / Jellyfish

A fast multi-threaded k-mer counter
Other
463 stars 136 forks source link

count in file command #14

Closed kes1smmn closed 10 years ago

kes1smmn commented 10 years ago

Dear Guillaume,

I wish I had a better handle on the bug, but I noticed something wonky with the count_in_file command. I found it is related to the initial hash size used when calling jf count. I illustrate it below.

If I start with a small fasta file.

>test AGTGAAGCCAATTGATTTTTTAGACCCC --- I build 4 equivalent jf databases with the following commands. Notice -s is the only option that changes. $ jellyfish count -C -s 10M -m 21 -o jf_test_10M.jf jf_test.fa $ jellyfish count -C -s 20M -m 21 -o jf_test_20M.jf jf_test.fa $ jellyfish count -C -s 30M -m 21 -o jf_test_30M.jf jf_test.fa $ jellyfish count -C -s 300M -m 21 -o jf_test_300M.jf jf_test.fa run count_in_file. I should get all kmers represented once [notice that some of the kmers are repeated]. ./count_in_file/count_in_file jf_test_10M.jf jf_test_20M.jf jf_test_30M.jf jf_test_300M.jf GCCAATTGATTTTTTAGACCC 1 1 1 0 CTAAAAAATCAATTGGCTTCA 1 0 0 0 GAAGCCAATTGATTTTTTAGA 1 0 0 0 AAAAAATCAATTGGCTTCACT 1 1 1 0 AGCCAATTGATTTTTTAGACC 1 1 1 0 CTAAAAAATCAATTGGCTTCA 0 1 1 0 AAGCCAATTGATTTTTTAGAC 1 0 0 0 GTGAAGCCAATTGATTTTTTA 1 0 0 0 CCAATTGATTTTTTAGACCCC 1 0 0 0 GAAGCCAATTGATTTTTTAGA 0 1 1 0 GTGAAGCCAATTGATTTTTTA 0 1 1 0 AAGCCAATTGATTTTTTAGAC 0 1 1 0 CCAATTGATTTTTTAGACCCC 0 1 1 1 GAAGCCAATTGATTTTTTAGA 0 0 0 1 GCCAATTGATTTTTTAGACCC 0 0 0 1 CTAAAAAATCAATTGGCTTCA 0 0 0 1 AGCCAATTGATTTTTTAGACC 0 0 0 1 AAGCCAATTGATTTTTTAGAC 0 0 0 1 AAAAAATCAATTGGCTTCACT 0 0 0 1 GTGAAGCCAATTGATTTTTTA 0 0 0 1 Thanks Keith
gmarcais commented 10 years ago

Hi Keith,

yes, count_in_file cannot work with files created with different size parameter (-s). The testing for the compatibility of the input file was buggy and you should have gotten an error.

Checkout the latest develop branch for a fixed version. The README file has been updated as well with some information.

kes1smmn commented 10 years ago

Thanks, I did work around it by creating a large hash size for even the smallest files. This seemed to work fine.
Will the new version catch whether jellyfish2 resized the hash if it was initially set too small? I noticed that the header does not seem to record whether the hash was automatically resized. If a resizing occurs the count will not work correctly.

Thanks keith.