lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.35k stars 311 forks source link

Seqtk sample ignores random seed #158

Closed MatthewRalston closed 4 years ago

MatthewRalston commented 4 years ago

Hello, I'm experiencing a trouble recreating your example. I am certain this is not user error. I have subsampled 10k reads as follows

>wc -l single_out.fq
2026312
>seqtk sample [-s $RANDOM] single_out.fq 10000 | wc -l
252
>seqtk sample [-s $RANDOM] single_out.fq 10000 | sha256sum
e279e6251a911ee24 ...
>seqtk sample [-s $RANDOM] single_out.fq 10000 | wc -l
252
>seqtk sample [-s $RANDOM] single_out.fq 10000 | sha256sum
e279e6251a911ee24 ...

#Also

>seqtk sample [-s $RANDOM] single_out.fq 1000 | wc -l
760
>seqtk sample [-s $RANDOM] single_out.fq 100000 | wc -l
760 # huh?

In contrast, setting my selection to 10000 in the following one-liners works fine. Not only is the correct number of reads produced (10000), but the data is fairly random according to the checksums.

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/subsampling_reads.pdf

I've tried running make clean and re-make-ing, no difference. Checking out the latest release commit did not change the subsampling either.