rvalieris / parallel-fastq-dump

parallel fastq-dump wrapper
MIT License
275 stars 33 forks source link

downloads are slower if maxSpotId is set higher than n_spots #21

Closed hgbrian closed 5 years ago

hgbrian commented 5 years ago

This is an issue if using maxSpotId to make sure no more than N spots are downloaded (e.g., if there are some very large RNA-Seq experiments I want to ignore).

For example, in this case there are only 5.4 million spots so the third thread does not do anything. This makes the download slower than not using -X 10000000.

$ parallel-fastq-dump -X 10000000 -t 3 -s SRR868679
SRR ids: ['SRR868679']
extra args: []
tempdir: /tmp/pfd_k2htn18j
SRR868679 spots: 5487730
blocks: [[1, 3333333], [3333334, 6666666], [6666667, 10000000]]
Read 2154397 spots for SRR868679
Written 2154397 spots for SRR868679
Read 3333333 spots for SRR868679
Written 3333333 spots for SRR868679

I believe the fix is just:

end = min(n_spots, args.maxSpotId) if args.maxSpotId is not None else n_spots

Thanks for the useful tool!

rvalieris commented 5 years ago

good catch ! should be fixed in 0.6.5