splatlab / squeakr

Squeakr: An Exact and Approximate k -mer Counting System
BSD 3-Clause "New" or "Revised" License
85 stars 23 forks source link

lognumslots.sh sometimes underestimates the required number of slots #32

Open hmusta opened 6 years ago

hmusta commented 6 years ago

I've noticed that on a small number of read sets (e.g. SRR522088), lognumslots.sh underestimates the number of slots needed in the CQF for squeakr-exact

Here's my current workflow for gzipped fastq files

ntcard -k 20 -c 2 -t 10 -p $OUTPREFIX $INPUT
NUMSLOTS=$(lognumslots.sh $OUTPREFIX\_k20.hist)
squeakr-count -g -k 20 -s $NUMSLOTS -t 10 -o $OUTDIR/ $INPUT

In the case of SRR522088, the script computed 26 as the required number of slots, resulting in a segfault. When I set it to 27, it runs smoothly.

Since this script is only in the master branch, I was wondering if there's perhaps a version tuned for the exact branch that I may not be finding in the repo.

prashantpandey commented 5 years ago

Hi @hmusta , in the current version of Squeakr, we have auto-resizing when running with a single thread. So, even if you underestimate the size there won't be a seg fault. Please try it and let me know if you still have any issues.

Thanks, Prashant

t-kranz commented 4 years ago

Hello,

i observed segfaults when using the value from lognumslots.sh as well, with the squeakr version from Oct 2019 (should be 5ad2ad6674c06a0fe7495d38bc467c2f854be72f).

This seems to happen frequently for me on very small test datasets.

Reproducing this should be quite simple:

Create a file (called 1.fastq) containing:

@1_1/1 TATGCACCAGAGTATGGAAGCATAAGCTCTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCAGTCAACAAAGCCGAGTGGGCGCAACGA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Then run ntcard followed by lognumslots.sh:

ntcard -k32 1.fastq -p ntcard.out lognumslots.sh ntcard.out_k32.hist

lognumslots returns 7, but the smallest value for which squeakr count doesn’t crash is 10.

squeakr count -n -e -k 32 -s 7 -o 1.squeakr 1.fastq

results in a seqfault, while

squeakr count -n -e -k 32 -s 10 -o 1.squeakr 1.fastq

works fine.