Open chklopp opened 2 years ago
Well, the tool is supposed to find all exact occurrences of input kmers: so if the kmer occurs only twice, klocate should report 2 occurrences (hence there may be a bug). Can you please share the reference you are using so I can troubleshot the issue?
Moreover, have you run kmc
using the -b
flag or not? Not sure if this may affect the results (probably not).
Best, Luca
I can not share this genome but I've done the same analysis with a public genome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
Here is the link to the corresponding kmc dump file http://genoweb.toulouse.inra.fr/~klopp/tmp/b2dec665f3e04333b064.dump
the corresponding fasta file http://genoweb.toulouse.inra.fr/~klopp/tmp/b2dec665f3e04333b064.dump.fasta
and the klocate resulting bed file http://genoweb.toulouse.inra.fr/~klopp/tmp/b2dec665f3e04333b064.dump.fasta.bed
Here is a list of cases were klocate produces more than 2 lines cut -f 4 b2dec665f3e04333b064.dump.fasta.bed | sort -n | uniq -c | awk '$1 > 3' | awk '{print $2"\t"$1}' | head 53 76 282 4 397 14 416 54 478 6 480 18 481 18 483 9 591 5 632 7
For example kmer 282 has 4 lines in the bed awk '$4 == 282' b2dec665f3e04333b064.dump.fasta.bed NC_001136.10 1163633 1163654 282 0 + NC_001140.6 291589 291610 282 0 + NC_001136.10 1154892 1154913 282 0 + NC_001136.10 1160284 1160305 282 0 +
NB. I did not use the -b kmc flag.
The issue was that I wrote in the README (and in the -h
) that the default value for k
is 23 but actually it was 21 in the code... If you run klocate
setting the right k
(i.e., -k 23
), it should work as expected. Indeed, if I run it on the data you shared, I get only 2 hits per kmer if ran with -k 23
while I get more occurrences with -k 21
.
Last commit (https://github.com/ldenti/klocate/commit/60b93c8459042b0b22556cc7f6e5169f5284449c) fixed it: now default value for k
is correctly set at 23.
Let me know if it works now.
Best, Luca
I'm interested in finding the location of kmer which are present twice in a genome fasta file.
I collect these kmers using kmc and transform the kmc dump to fasta before running klocate.
When I count the number of alignments in the bed file per kmer I find some of them with more than the two expect alignments. And when I check their presence (forward and reverse complement) in the genome I find them only twice.
example kmer 16 : found twice in the genome (using grep) both reversed AAAAACACAAGATAAATATTATA 0 TATAATATTTATCTTGTGTTTTT 2
but aligned three times by klocate ptg000002l 1589205 1589226 16 0 - ptg000017l 3750727 3750748 16 0 - ptg000021l 6372951 6372972 16 0 -
Are the alignments stored in the bed file complete (all the kmer sequence) and exact (no difference)?