fasterq-dump hangs when reading SRA from BeeGFS, but only when running with 4 or more threads

ncbi / sra-tools

SRA Tools

Other

1.12k stars 245 forks source link

fasterq-dump hangs when reading SRA from BeeGFS, but only when running with 4 or more threads #383

Open jlbl opened 4 years ago

jlbl commented 4 years ago

Environment: CentOS-7 x86_64 based HPC cluster, BeeGFS 7.1.3 parallel filesystem SRA toolkit version: multiple, up to and including 2.10.8 Issue: When reading a SRA file from the BeeGFS filesystem and when using '-e 4' or higher, fasterq-dump hangs indefinitely. Any attempt to 'cat /proc/$PID/cmdline' of the fasterq-dump process also hangs indefinitely, yielding interesting side effects such as basic commands like 'ps' and 'w' hanging indefinitely. The only solution is to hard boot the host. Reading the SRA from the network works just fine, even if the temp directory is on BeeGFS.

I've put an strace of the hang at https://pastebin.com/b0NvxZ5z. If we can further help debug this or test fixes, please let us know. Thanks.

durbrow commented 4 years ago

Sounds like https://en.wikipedia.org/wiki/Sleep_(system_call)#Uninterruptible_sleep

jlbl commented 4 years ago

If I cat /proc/$PID/status of a hung fasterq process, it says it's in "State: S (sleeping)". It's cat /proc/$PID/cmdline of the fasterq-dump process that goes into "D" state (and thus so do ps, w, and friends).

jlbl commented 4 years ago

Also, after a bit more testing (with different SRA files), the number of threads that causes the issue is variable. Sometimes I can successfully use more than 4, and sometimes less than 4 causes the hang. I have yet to see the issue when using '-e 1'.

HenrikBengtsson commented 4 years ago

(Colleague of @jlbl here) Independently of this issue, but escalated because of it, I'd like to suggest making it possible to configure the default number of threads via an environment variable (or by other means). The default for fasterq-dump is currently six (6) threads and, AFAIK, this can only be overridden by the end-user via CLI option -e <nthreads>. If this default could respect, say, export FASTERQDUMP_THREADS=2, sysadmins could limit the default number of threads. In this particular case, we could then set export FASTERQDUMP_THREADS=1 to work around the reported issue until resolved.

BTW, where in the code is the default of six threads set?

PS. It's not unusual to see software tools that default to running more than core/thread to cause issues on multi-tenant environments, e.g. CPU overloading. Often the end-user is not aware of this. I'm aware we're talking threads here, so slightly less relevant.

kleigeb commented 1 year ago

Just ran into this issue with our BeeGFS filesystem, using sra-tools v3.0.1. I see this hasn't been updated in quite some time, are people still just forging on with -e 1 and user education?

HenrikBengtsson commented 1 year ago

Just learned that this issue has been addressed in BeeGFS 7.3.2 (2022-10-14). From the release notes:

Fixed a deadlock between mmap and read when both were operating on the same memory area. This should fix issues with multithreaded runs of fasterq-dump that have been reported by some users.