refresh-bio / KMC

Fast and frugal disk based k-mer counter
256 stars 73 forks source link

A needle in haystack #116

Closed Sheikhizadeh closed 5 years ago

Sheikhizadeh commented 5 years ago

In a KMC3 database of ~5B kmers, I found that the first following kmer is replaced by the second one!!! prefix: GCGGATGGTTTGG suffix: TTCCCCGA prefix: GCGGATGGTTTGG suffix: TTGCCCGA

But, KMC2 did the right job.

Best, Siavash

marekkokot commented 5 years ago

Thanks for reporting that. Could you please give me a link to input dataset that you used to produce database and exact command that you used? I would really like to be able to reproduce this bug.

Sheikhizadeh commented 5 years ago

Hi Marek,

Here is the data: https://www.hmpdacc.org/HMREFG/ I separated the genomes in 6267 FASTA files which you may not have to do, and the command was: "kmc -cs127 -r -k21 -t16 -m " + (Runtime.getRuntime().maxMemory() / 1073741824L) + " -ci1 -fm @genomes.txt index/kmers index. The wrong kmer is the first kmer of this genome BACT_461_305.fasta I'm still not sure what went wrong.

Best,

Siavash

marekkokot commented 5 years ago

Hi,

could you please show me how you split data into multiple files because the number of sequences in it is 188039.

What is the amount of RAM on your machine, because (Runtime.getRuntime().maxMemory() / 1073741824L) only tells me that you call kmc from JAVA :) BTW setting whole avaiable RAM of a machine for KMC and using -r not be good idea, because specified value does not include additional memory needed for -r parameter, although for such small input KMC will probably not use the whole RAM specified with -m (I have also noticed a space between -m and specified value. Is it really in this command? KMC, in general, should not run in such a case.

For now, from what I have observed, it is possible that the bug is related to using many input files, so it would be really appreciated if you could show me how you split your file. If you could also try to run KMC on the original file and check the result. BTW, you may check if two KMC databases are equal (exactly the same k-mers and counters) using kmc_tools: ./kmc_tools compare <db1> <db2> (this is not documented in the kmc_tools docs, as it is for development purposes only).

Best, Marek

Sheikhizadeh commented 5 years ago

Thank you for your helpful comments on the command. I made and compared two databases with -fm @genomes.txt and it said "DBs differ" , But with -fm all_seqs.fa "DBs equal" . So, I will email you the script I use to break down the original file, to find the bug.

marekkokot commented 5 years ago

Hi,

on my experiments, kmc3 produces wrong results even for all_seqs.fa. In general, it seems this issue is caused by the fact that one of the sequences is empty (>BACT_3013|gi|308222630|gb|AEHJ00000000.1|AEHJ01000000 Bifidobacterium dentium JCVIHMP022 whole genome shotgun sequencing project) i.e. there is no symbol. Here is this file part:

GATATGTCCCCAAATGATAAACTTCCTACAGAAAAAGAAATCTGCGAAGAATACTCAGTTAGTCGGACTA
CTGTTCGTTTAACCATGAATGAGTTGGAAAATAGAGGGTATATTTATCGTATTCAGGGAAAAGGATCTTT
TGTTTCTTCCATTAAAAAAAATACAATTAATTCTTTTTTTGACCTTGATTTCAGAGCGCATTATGAAGGT
ATGAATTCAGAAGAATTAACTAGTGAACTAATTTTTTTCAAGAAAGAGGCTGCACAATTAAGCCTAAGAC
AGAAAATGGGTTTACAACAAGAACACAATATTATTAAAATTCAAGTATTGCGTAAGTTGTTAGATATACC
TGTAGCATTAGAGACCATTATTTTGAAAAACCAGTATTTTAATTTCATTAATGAAGCTAAGTTAAAAGAA
GAAGGGGTAGATAAGTTAATTAATACGATTGATATCCCTTTAAAAATTGTTGAAGAAAAATACAAAGCTC
GTAGATTAAATCCTGATGAAATTGATCTTTTGCAAAGCAAAGATGAGGCAGCTTTAGTTGTAACTAAGTC
TCTTTACAATACAAATAACGAATTAATTATTATTTCTGAAAGAAAAATTTTAACGAGTAAACTTTCTTAC
CAAAATTTTATTCAAAAAGAAAACTAGCTTTTTTATAACGCGAATTGTAAAATTAAGTTAGA
>BACT_3013|gi|308222630|gb|AEHJ00000000.1|AEHJ01000000 Bifidobacterium dentium JCVIHMP022 whole genome shotgun sequencing project
>BACT_2222|gi|15805042|ref|NC_001263.1| Deinococcus radiodurans R1 chromosome 1, complete sequence
TCACGCGAACTCTGGCCTCGGTTCAAGCGGCGGTGAACTTTCCGGGTGATGGGCCAGGGCAGACATGCCC
CGTAACGGCCTCAGAAGGCCCCTTAAACGCGCAGCGTGACCATGACCGCCATATACCCGCCTCCCCCACC
TACGGGCGCGGAAAATGGCGGGGCTGGTGGACCCGGAGAACAGCAGCTCGGCGTCTCCGTCAATCGGGCC
CAGCGCATCGAGCACATGGCGAGCGTTGAAGGCGAGGCTCATCGCCTGCTCGGTGCCGCCCTGGGTGACG
CTGAGCGTGTCCTGAGCGCGGCCATAGTCGCCCTCCGCAGCGAGGCGCAGAGTGCCTTCGGACACCAGAA
ACTCGACGCGGTTGTTGGCGTTTTTGTCGGCCAGCACGGCCACACGGTTGACCGCTTCCTTGAGGGCGGT
GGCGGGCAGTGTCACCTGAAGTTTGATGTCCTTGGGAATGACCCGCTCGTAGTCGGGAAAATCACCGTCG
AGCAGCTTGAGGTTCATCTTCACGCGGTCGGTGGTCACGGTGAGCATGCCGTCGCCGTAGGTGAACCGCG
CCTCGCCGTCCTTGAGCACGCGAATCAGTTCGTCCACGCTGCGGGCGGGAATAATCAGGTTTTTGCCGTC

Honestly, I am not sure if this is valid multi-line fasta file, nevertheless, I hope the last commit will fix this. I would appreciate if you could verify if it helps and let me know.

Thanks again for reporting that bug, and for using our software.

Best, Marek

marekkokot commented 5 years ago

I am assuming it helped, if not please reopen this issue and supply some details.