Exact kmer match - Githubissues

chklopp commented 2 years ago

I'm interested in finding the location of kmer which are present twice in a genome fasta file.

I collect these kmers using kmc and transform the kmc dump to fasta before running klocate.

When I count the number of alignments in the bed file per kmer I find some of them with more than the two expect alignments. And when I check their presence (forward and reverse complement) in the genome I find them only twice.

example kmer 16 : found twice in the genome (using grep) both reversed AAAAACACAAGATAAATATTATA 0 TATAATATTTATCTTGTGTTTTT 2

but aligned three times by klocate ptg000002l 1589205 1589226 16 0 - ptg000017l 3750727 3750748 16 0 - ptg000021l 6372951 6372972 16 0 -

Are the alignments stored in the bed file complete (all the kmer sequence) and exact (no difference)?

ldenti commented 2 years ago

Well, the tool is supposed to find all exact occurrences of input kmers: so if the kmer occurs only twice, klocate should report 2 occurrences (hence there may be a bug). Can you please share the reference you are using so I can troubleshot the issue?

Moreover, have you run kmc using the -b flag or not? Not sure if this may affect the results (probably not).

Best, Luca

chklopp commented 2 years ago

I can not share this genome but I've done the same analysis with a public genome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz

Here is the link to the corresponding kmc dump file http://genoweb.toulouse.inra.fr/~klopp/tmp/b2dec665f3e04333b064.dump

the corresponding fasta file http://genoweb.toulouse.inra.fr/~klopp/tmp/b2dec665f3e04333b064.dump.fasta

and the klocate resulting bed file http://genoweb.toulouse.inra.fr/~klopp/tmp/b2dec665f3e04333b064.dump.fasta.bed

Here is a list of cases were klocate produces more than 2 lines cut -f 4 b2dec665f3e04333b064.dump.fasta.bed | sort -n | uniq -c | awk '$1 > 3' | awk '{print $2"\t"$1}' | head 53 76 282 4 397 14 416 54 478 6 480 18 481 18 483 9 591 5 632 7

For example kmer 282 has 4 lines in the bed awk '$4 == 282' b2dec665f3e04333b064.dump.fasta.bed NC_001136.10 1163633 1163654 282 0 + NC_001140.6 291589 291610 282 0 + NC_001136.10 1154892 1154913 282 0 + NC_001136.10 1160284 1160305 282 0 +

NB. I did not use the -b kmc flag.

ldenti commented 2 years ago

The issue was that I wrote in the README (and in the -h) that the default value for k is 23 but actually it was 21 in the code... If you run klocate setting the right k (i.e., -k 23), it should work as expected. Indeed, if I run it on the data you shared, I get only 2 hits per kmer if ran with -k 23 while I get more occurrences with -k 21.

Last commit (https://github.com/ldenti/klocate/commit/60b93c8459042b0b22556cc7f6e5169f5284449c) fixed it: now default value for k is correctly set at 23.

Let me know if it works now.

Best, Luca

ldenti / klocate

Exact kmer match #2