refresh-bio / KMC

Fast and frugal disk based k-mer counter
256 stars 73 forks source link

wrong kmer counts on fastq file with empty reads #88

Closed nwespe closed 5 years ago

nwespe commented 6 years ago

Hello,

I'm using KMC to count kmers in fastq files of Illumina paired-end reads. I am using Cutadapt to first trim adapters and low-quality bases, and depending on the read-filtering options I use, the resulting fastq files can have empty reads or not. If I run KMC on a file containing empty reads, it doesn't appear to count the kmers in all of the reads. The command I run is: kmc -k21 -cs1000 -t32 trimmed_R1.fastq kmc_R1 ./kmc_temp/. Below are some output stats.

KMC run on file containing 28114420 reads, including 6222 empty reads Stats: No. of k-mers below min. threshold : 86711467 No. of k-mers above max. threshold : 0 No. of unique k-mers : 122153453 No. of unique counted k-mers : 35441986 Total no. of k-mers : 1505089500 Total no. of reads : 11822644 Total no. of super-k-mers : 222084259

KMC run on file containing 28099409 reads (no empty reads) Stats: No. of k-mers below min. threshold : 199380233 No. of k-mers above max. threshold : 0 No. of unique k-mers : 243868805 No. of unique counted k-mers : 44488572 Total no. of k-mers : 3575041753 Total no. of reads : 28099409 Total no. of super-k-mers : 527520008

Occasionally, running KMC on a file with empty reads gives me a "Wrong input file" error, which is corrected if I remove the empty reads.

What is KMC's behavior when it encounters an empty read? Why does it only count ~40% of the reads, despite only a very small number being empty?

Thank you, and please let me know if you need more information to investigate this.

marekkokot commented 6 years ago

Hi, thanks for using KMC. KMC do not support empty reads (are empty reads even allowed in fastq format?). The behaviour you mentioned is a result of KMC's assumption that each read will contain at least one character. Nevertheless, as it seems it may be useful to support them, so I added this functionality with 3284302251236d83153581f87b1cb752ed73c622. Could you please, check if it works for your input? Thanks again for using our software.

marekkokot commented 5 years ago

I am assuming this fix works properly, if not please reopen this issue.