ncbi / sra-tools

SRA Tools
Other
1.07k stars 243 forks source link

fasterq-dump not respecting --threads? #494

Open HenrikBengtsson opened 3 years ago

HenrikBengtsson commented 3 years ago

Hi, I'm trying to force fasterq-dump to run in single-threaded mode. However, when trying:

$ fasterq-dump --threads 1 SRR000001

I still see multiple threads running;

$ pstree -p 31123 -a
fasterq-dump,31123 --threads 1 SRR000001
  └─fasterq-dump-or,31124 -e 1 SRR000001
      ├─{fasterq-dump-or},31129
      ├─{fasterq-dump-or},31130
      ├─{fasterq-dump-or},31157
      ├─{fasterq-dump-or},31158
      └─{fasterq-dump-or},31159

When benchmarking --threads 1 and --thread 6, they both clock in at ~50 seconds;

$ time fasterq-dump --threads 1 SRR000001
$ time fasterq-dump --threads 6 SRR000001

It's like option --thread is ignored. What am I missing? (Disclaimer: I'm a rookie on fasterq-dump)

This is with:

$ fasterq-dump --version

"fasterq-dump" version 2.11.0
wraetz commented 3 years ago

fasterq-dump partitions the work to be done into multiple threads. Each thread handles a slice of the tables. Running fasterq-dump with just one thread makes no sense - it does not give you any speedup. That is why it ignores '--threads 1' If you absolutely have to run on just one thread use fastq-dump instead ( but it will be slower ). If you run fastq-dump instead and inspect it with 'pstree' you will still see at least 2 threads - because there is a background-cache process running which downloads data ahead of time. If you do not want that, you have to 'prefetch' the accession and run fastq-dump on it.

HenrikBengtsson commented 3 years ago

Thanks for the prompt reply.

Running fasterq-dump with just one thread makes no sense - it does not give you any speedup. That is why it ignores '--threads 1' If you absolutely have to run on just one thread use fastq-dump instead ( but it will be slower ).

See https://github.com/ncbi/sra-tools/issues/463#issuecomment-824321890 and https://github.com/ncbi/sra-tools/issues/161#issuecomment-808294889 for the background why I try to "force" single-threaded processing and why this is not my choice - I'm trying to find a way for fasterq-dump not to crash machines when users run it themselves. (FWIW, right now we're going with a --temp "$(mktemp -d)" workaround and hoping that'll do).

Regarding:

That is why it ignores '--threads 1'

I understand that the code is trying to be helpful here, but I think that is an unfortunate "feature" and I'd like to argue for the tool to respect what resources are requested, or at a minimum, give an informative warning about it, or possibly an error.

klymenko commented 3 years ago

Use fastq-dump if you want to limit number of threads to 1!

HenrikBengtsson commented 3 years ago

@klymenko, please do read my comment and then the two discussions I link to in the two issues. As you see there, fasterq-dump is fully capable of completely crashing machines (requires reboot).