Open jlbl opened 4 years ago
If I cat /proc/$PID/status
of a hung fasterq process, it says it's in "State: S (sleeping)". It's cat /proc/$PID/cmdline
of the fasterq-dump process that goes into "D" state (and thus so do ps
, w
, and friends).
Also, after a bit more testing (with different SRA files), the number of threads that causes the issue is variable. Sometimes I can successfully use more than 4, and sometimes less than 4 causes the hang. I have yet to see the issue when using '-e 1'.
(Colleague of @jlbl here) Independently of this issue, but escalated because of it, I'd like to suggest making it possible to configure the default number of threads via an environment variable (or by other means). The default for fasterq-dump
is currently six (6) threads and, AFAIK, this can only be overridden by the end-user via CLI option -e <nthreads>
. If this default could respect, say, export FASTERQDUMP_THREADS=2
, sysadmins could limit the default number of threads. In this particular case, we could then set export FASTERQDUMP_THREADS=1
to work around the reported issue until resolved.
BTW, where in the code is the default of six threads set?
PS. It's not unusual to see software tools that default to running more than core/thread to cause issues on multi-tenant environments, e.g. CPU overloading. Often the end-user is not aware of this. I'm aware we're talking threads here, so slightly less relevant.
Just ran into this issue with our BeeGFS filesystem, using sra-tools v3.0.1. I see this hasn't been updated in quite some time, are people still just forging on with -e 1
and user education?
Just learned that this issue has been addressed in BeeGFS 7.3.2 (2022-10-14). From the release notes:
fasterq-dump
that have been reported by some users.
Environment: CentOS-7 x86_64 based HPC cluster, BeeGFS 7.1.3 parallel filesystem SRA toolkit version: multiple, up to and including 2.10.8 Issue: When reading a SRA file from the BeeGFS filesystem and when using '-e 4' or higher, fasterq-dump hangs indefinitely. Any attempt to 'cat /proc/$PID/cmdline' of the fasterq-dump process also hangs indefinitely, yielding interesting side effects such as basic commands like 'ps' and 'w' hanging indefinitely. The only solution is to hard boot the host. Reading the SRA from the network works just fine, even if the temp directory is on BeeGFS.
I've put an strace of the hang at https://pastebin.com/b0NvxZ5z. If we can further help debug this or test fixes, please let us know. Thanks.