uni-halle / gerbil

A fast and memory-efficient k-mer counter with GPU-support
MIT License
34 stars 14 forks source link

Internal functioning of Gerbil #26

Closed waltercostamb closed 11 months ago

waltercostamb commented 11 months ago

Hello developers!

I have a question about the functioning of Gerbil. Does it skip some subsequences of a FASTA genome when counting kmers? I have a genome which contains subsequence GGTACGCCC. If I grep it in the genome, it shows 17 hits:

$grep GGTACGCCC GCA_001580015.1.fasta | wc -l
17

However, if I run Gerbil as below I do not find this kmer or its reverse complement CCAUGCGGG. Could you please help me to understand this result?

$gerbil -e 30GB -t 10 -k 9 -l 1 -o fasta GCA_001580015.1.fasta . test_9.out
$grep CCAUGCGGG test_9.out | wc -l
0
$grep GGTACGCCC test_9.out | wc -l
0

Thank you in advance! Maria

merbert commented 11 months ago

Hello Maria,

Gerbil will not skip any k-mer if the option -l is set to 1. However, the reverse complement of the DNA sequence GGTACGCCC is GGGCGTACC (reverse the order of the original sequence and exchange T<->A and C<->G). When normalizing, Gerbil prefers GGGCGTACC over GGTACGCCC (corresponds to the lexicographical order). If required, you can disable the normalization with the option -d (e.g. for single-stranded DNA sequence).

Best regards, Marius

waltercostamb commented 11 months ago

Hello Marius,

thank you for the prompt answer. I noticed the reverse complement was incorrect. I checked again, and GGGCGTACC shows up 17 times (for GGTACGCCC) plus 12 for GGGCGTAC.

Best regards, Maria