lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.37k stars 308 forks source link

Sampling single reads failed #111

Closed suzuki-shm closed 6 years ago

suzuki-shm commented 6 years ago

Hi,

When I used seqtk sample command with 1 as number of sequence, I didn't get a file that have single reads, but I got all of reads(=input sequence file). This reproduced by multiple test sequence files and seeds. The commands I used is

seqtk sample test_reads.fq 1

I ran it with CWL wrapped seqtk, and raw software inside docker container.

shenwei356 commented 6 years ago

Try

 seqtk sample test_reads.fq 1.1

It works

tseemann commented 6 years ago

@TaskeHAMANO The number can be a fraction OR an exact number. You have discovered an ambuguity with the string 1. It could be 1 read, or ALL the reads (fraction 1.0). @shenwei356 has a nice workaround - because fraction must be between 0 and 1, if you give it 1.1 it must be treating it as a number, and rounding it down to 1.

eboyden commented 2 years ago

I would actually prefer that 1.0 be treated as a float (returning all reads) rather than an int, whereas 1 would return a single read. I'd recommend enforcing this behavior, so that integers are treated as numbers and floats are treated as fractions. This could also be used to allow oversampling, e.g. 10 would return 10 reads but 10.0 would oversample 10X.