lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.37k stars 308 forks source link

Add docs for seqtk sample requesting more reads than total #86

Open tshtatland opened 7 years ago

tshtatland commented 7 years ago

When seqtk sample requests more reads (integer argument) than there are reads in the input, the input is returned and no warnings are written to STDERR. Exit status is 0 (success).

This behavior may not be obvious to all users and I could not find it quickly in user docs. Please add this to the docs if this is in fact the intended behavior. Thank you!

tseemann commented 6 years ago

Why are you requesting more reads than are in the FASTQ? It sounds like you should be checking first, maybe with seqtk stats ?

tshtatland commented 6 years ago

Thank you, I will use seqtk stats when I use the tools from the command line. Unfortunately, most of our users cannot do this easily, because we are using seqtk Galaxy tool wrapper as the first step in a multi-step pipeline (Galaxy workflow). A common example use case is to downsample 10-50 fastq files (each typically 2-4 million) to the same number of reads (1 million). Occasionally, there would be a fastq file with substantially less reads (0.5 million). In that case, the user would want to examine warnings after downsampling and maybe restart all processing with downsampling to 0.5 million. Or maybe not, depending on the design of the experiment. So warnings would be very nice, perhaps under verbose mode turned on. But documenting this behavior would be a great first step. Thank you!

peterjc commented 5 years ago

I stumbled on this behaviour in similar circumstances.

Documenting this would be an improvement.

Adding a warning to stderr would go further.

However, I would actually prefer a strict mode where seqtk sample input.fq N fails if the input has less than N reads (i.e. message to stderr, and non zero return code). This could be enabled via a command line switch, but personally I would make it the default.

tshtatland commented 5 years ago

@peterjc's suggestion is the best of all listed above. I prefer it as the most consistent with expectations of our users, and as the most robust overall for our use.

tseemann commented 5 years ago

Wow - this conversation has happened over 3 years!

nh13 commented 3 years ago

My 2C is if I request more reads than exist, it should output the input. I have pipelines where I downsample after adapter trimming so I want to sure ensure that I have at most 5M reads (for example). Perhaps an option to determine this behavior is warranted, but the default should remain as is.

peterjc commented 1 year ago

Given the time which has passed, I agree adding my proposed strict mode as the default would be problematic. It would have to be an option due to avoid breaking historic usage.