refresh-bio / KMC

Fast and frugal disk based k-mer counter
277 stars 72 forks source link

Not processing all reads when limiting number of threads #235

Closed nikostr closed 5 months ago

nikostr commented 6 months ago

Providing the -t parameter leads to not all reads being processed. I'm running the following script:

for f in */input_files.txt; do
    outdir_prefix=$PWD/test/$(basename $(dirname $f))
    mkdir -p $outdir_prefix
    log=test/$(basename $(dirname $f))/log.log
    kmc -v -t2 \
        @${f} output_kmc_all ${outdir_prefix} \
        2> ${outdir_prefix}/kmc_all.2 \
        > ${log}
done

There is one input_files.txt per individual, each pointing to a pair of reads available here: https://github.com/akcorut/kGWASflow/tree/main/.test/data/test_reads

When I specify -t2 the total number of reads reported in the log file is sometimes lower than when I do not specify a value for -t. Is this expected behavior?

marekkokot commented 6 months ago

Hi, thanks! I was able to reproduce this (this is good). I take a look at individual_100 and I think the reason is that input fastq files are broken. KMC detects some issues with fastq files, but definetly not all of them. Here is an example individual_100_R1.fastq file is broken: image So there is no read for this header @individual_100.426381 426381 length=151 at line 899 (this is also wired that the same header occurs multiple times, but it shouldn't matter). Let me know if it helps.

nikostr commented 5 months ago

Thank you for looking closer at this, and sorry for not verifying the fastq-files prior to submitting this issue! This helped a ton!

marekkokot commented 5 months ago

That's no problem; I'm just glad it's not a KMC bug; that's a relief. Thanks again for using KMC :)