Open albertz opened 4 years ago
I just realized that the hang itself is in a sess.run
which is not related at all to Horovod.
However:
DoConvolve
, or cuEventSynchronize
, or so) yields nothing, which is a bit strange.Another instance of the same problem (probably), this time without Horovod, single GPU, with TF 2.3 and CUDA 10.1. The gdb+python log can be found here
So this problem seems to be disconnected from Multi-GPU training.
I just looked at the training dirs and found several more instances where the training was hanging:
Log: /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-d5-tf1-14/qdir/q.log/train.o1425449.1
:
RETURNN starting up, version 20200630.143255--git-8fc19e7-dirty, date/time 2020-07-25-11-44-37 (UTC+0200), pid 6349, cwd /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-d5-tf1-14, Python /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/bin/python3
Hostname: cluster-cn-254
TensorFlow: 1.14.0 (v1.14.0-0-g87989f6959) (<site-package> in /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow)
[...]
2020-07-25 11:44:41.348288: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
(killed after three days near the end of epoch 41, 93.81%)
Log: /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-b/qdir/q.log
/train.o1357539.1
RETURNN starting up, version 20200630.143255--git-8fc19e7, date/time 2020-07-06-16-24-31 (UTC+0200), pid 29917, cwd /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-
b, Python /usr/bin/python3
RETURNN command line options: ['config-train/rna-transducer.fs_pretrain-b.config']
Hostname: cluster-cn-273
TensorFlow: 1.15.2 (v1.15.0-92-g5d80e1e) (<site-package> in /u/merboldt/.local/lib/python3.5/site-packages/tensorflow)
[...]
2020-07-06 16:24:32.224258: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
(killed after three days in epoch 40, at 11.49%) In this case, the maximum RSS usage went up (watched with some script), from 23.2GB to 75.7 GB, then it was killed. So this maybe is interesting.
Log: /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-e/qdir/q.log/train.o1379626.1
RETURNN starting up, version 20200630.143255--git-8fc19e7, date/time 2020-07-12-00-01-23 (UTC+0200), pid 30716, cwd /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-e, Python /usr/bin/python3
RETURNN command line options: ['config-train/rna-transducer.fs_pretrain-e.config']
Hostname: cluster-cn-269
TensorFlow: 1.15.2 (v1.15.0-92-g5d80e1e) (<site-package> in /u/merboldt/.local/lib/python3.5/site-packages/tensorflow)
[...]
2020-07-12 00:01:24.010922: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
Log: /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pretrain-gather.w_retrain.pt24.tf2-3/qdir/q.log/train.o1541649.1
RETURNN starting up, version 20200728.184724--git-126ebac-dirty, date/time 2020-08-10-17-24-59 (UTC+0200), pid 9723, cwd /work/asr3/zeyer/merboldt/setups-data/2020-01-08--rnnt-rna/data-train/rna-transducer.fs_pret
rain-gather.w_retrain.pt24.tf2-3, Python /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/bin/python3
RETURNN command line options: ['config-train/rna-transducer.fs_pretrain-gather.w_retrain.pt24.tf2-3.config']
Hostname: cluster-cn-257
TensorFlow: 2.3.0 (v2.3.0-rc2-23-gb36436b087) (<site-package> in /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow)
Use num_threads=4 (but min 2) via SGE num_proc.
Setup TF inter and intra global thread pools, num_threads 4, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 4, 'inter_op_parallelism_threads': 4}.
(killed after three days and 24/24GB usage, at epoch 46, step 4206, 89.39% of the ep)
So this issue happened to me with TF 1.14, 1.15 and 2.3.
Also, the following CUDA/nvidia libs were loaded:
/usr/lib/x86_64-linux-gnu/coreutils/libstdbuf.so
/usr/lib/x86_64-linux-gnu/libcuda.so.418.67
/usr/lib/x86_64-linux-gnu/libmpdec.so.2.4.2
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.67
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.418.67
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21
/usr/local/cuda-10.0/lib64/libcublas.so.10.0.130
/usr/local/cuda-10.0/lib64/libcudart.so.10.0.130
/usr/local/cuda-10.0/lib64/libcufft.so.10.0.145
/usr/local/cuda-10.0/lib64/libcurand.so.10.0.130
/usr/local/cuda-10.0/lib64/libcusolver.so.10.0.130
/usr/local/cuda-10.0/lib64/libcusparse.so.10.0.130
/usr/local/cudnn-10.1-v7.6/lib64/libcudnn.so.7.6.3
To add, dmesg
shows these messages, which might be related:
[Thu Aug 27 20:06:05 2020] pcieport 0000:80:03.0: AER: Corrected error received: id=8018
[Thu Aug 27 20:06:05 2020] pcieport 0000:80:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=8018(Transmitter ID)
[Thu Aug 27 20:06:05 2020] pcieport 0000:80:03.0: device [8086:6f08] error status/mask=00001000/00002000
[Thu Aug 27 20:06:05 2020] pcieport 0000:80:03.0: [12] Replay Timer Timeout
@Spotlight0xff reported that this might be related to OMP_NUM_THREADS
. Specifically, he observed the hangs with OMP_NUM_THREADS=6
but not anymore with OMP_NUM_THREADS=1
.
RETURNN also might use this value for intra_op_parallelism_threads
/inter_op_parallelism_threads
(there is SGE num_proc: FileNotFoundError(2, 'No such file or directory')
in the log, so it failed to use the SGE num_proc value, so this is usually the fallback).
It is not clear currently whether this is really due to OMP, or due to TF intra_op_parallelism_threads
/inter_op_parallelism_threads
.
@Spotlight0xff do you have more insights on this?
Unfortunately I still get sometimes stuck error, but only on Switchboard, so this have something to do with ExternSprintDataset
maybe?
But setting OMP_NUM_THREADS
helped to lower the frequency of these hangs.
@Spotlight0xff Is that still used for intra_op_parallelism_threads
/inter_op_parallelism_threads
as well, or now independent (because it correctly uses your SGE num_proc setting)? In the first case, it means this is somehow related to multithreading in TF, and this might still be a CUDA related problem. In the latter case, it means it is related to sth with OMP (but what exactly uses OMP?).
ExternSprintDataset
should have almost no influence at all to the main proc, as Sprint runs in an independent background proc, and the main proc just reads the data (received via OS pipe, pickle + raw Numpy for serialization). But of course, with multithreading involved, timings will be different, and this could trigger some race conditions.
I have encountered a similar issue during search (on LibriSpeech). TF 2.3.0, CUDA 10.1, CudNN 7.6.3: Loaded libs (only CUDA/CUDNN):
/usr/lib/x86_64-linux-gnu/libcuda.so.418.67
/usr/local/cuda-10.1/lib64/libcublasLt.so.10
/usr/local/cuda-10.1/lib64/libcublas.so.10
/usr/local/cuda-10.1/lib64/libcudart.so.10.1
/usr/local/cuda-10.1/lib64/libcufft.so.10
/usr/local/cuda-10.1/lib64/libcurand.so.10
/usr/local/cuda-10.1/lib64/libcusolver.so.10
/usr/local/cuda-10.1/lib64/libcusparse.so.10
/usr/local/cudnn-10.1-v7.6/lib64/libcudnn.so.7.6.3
/work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
RETURNN options.
TensorFlow: 2.3.0 (v2.3.0-rc2-23-gb36436b087) (<site-package> in /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow)
Use num_threads=2 (but min 2) via SGE num_proc.
Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}.
The GDB stacktrace is here. Corresponding python stack-trace here.
Relevant bits from the stacktrace:
Thread 23 (Thread 0x150b5fe21700 (LWP 20749)):
#0 0x0000150f51f168e7 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000150f4bdf3abb in inner_thread () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#2 0x0000150f4bf1cc59 in exec_blas () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#3 0x0000150f4bdf2f58 in gemm_driver () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#4 0x0000150f4bdf3bc6 in dgemm_thread_tn () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#5 0x0000150f4bd05609 in cblas_dgemm () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#6 0x0000150f4dc5cb90 in gemm (order=CblasRowMajor, R=0x150b5f945080, ldb=<optimized out>, B=<optimized out>, lda=<optimized out>, A=0x150b5fa05b70, k=<optimized out>, n=<optimized out>, m=<optimized out>,
transB=<optimized out>, transA=<optimized out>, typenum=12) at numpy/core/src/common/cblasfuncs.c:36
#7 cblas_matrixproduct (typenum=typenum@entry=12, ap1=ap1@entry=0x150b5fa05b70, ap2=<optimized out>, ap2@entry=0x150b5fa05da0, out=out@entry=0x0) at numpy/core/src/common/cblasfuncs.c:678
#8 0x0000150f4dac2737 in PyArray_MatrixProduct2 (op1=<optimized out>, op2=0x150b5fa05da0, out=0x0) at numpy/core/src/multiarray/multiarraymodule.c:942
#9 0x0000150f4dac308a in array_matrixproduct (__NPY_UNUSED_TAGGEDdummy=<optimized out>, args=<optimized out>, kwds=<optimized out>) at numpy/core/src/multiarray/multiarraymodule.c:2226
#10 0x0000150f5253d350 in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, func=0x150f4df93f40) at ../Objects/call.c:742
#11 PyObject_Call () at ../Objects/call.c:232
#12 0x0000150f4da135d5 in array_implement_array_function (__NPY_UNUSED_TAGGEDdummy=<optimized out>, positional_args=<optimized out>) at numpy/core/src/multiarray/arrayfunction_override.c:265
#13 0x0000150f5253cd37 in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, func=0x150f4df9b8b0) at ../Objects/call.c:757
#14 PyCFunction_Call () at ../Objects/call.c:772
#15 0x0000150f5253b634 in _PyObject_MakeTpCall () at ../Objects/call.c:159
...
...
Thread 10 (Thread 0x150f32ef9700 (LWP 19802)):
#0 0x0000150f521fe98d in pthread_join (threadid=23152700049152, thread_return=0x0) at pthread_join.c:90
#1 0x0000150f4bf1cfaa in blas_thread_shutdown_ () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#2 0x0000150f51ef8425 in __libc_fork () at ../sysdeps/nptl/fork.c:96
#3 0x0000150ee3aec187 in tensorflow::SubProcess::Start() () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#4 0x0000150ef167dc2f in stream_executor::CompileGpuAsm(int, int, char const*, stream_executor::GpuAsmOpts) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x0000150ef167f5ce in stream_executor::CompileGpuAsm(int, char const*, stream_executor::GpuAsmOpts) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x0000150ef167fed4 in stream_executor::CompileGpuAsmOrGetCached(int, char const*, stream_executor::GpuAsmOpts) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x0000150ef167a755 in stream_executor::RedzoneAllocator::CheckRedzones() const ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x0000150ef1676b7a in tensorflow::CheckRedzones(stream_executor::RedzoneAllocator const&, tensorflow::AutotuneResult*) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9 0x0000150eee5f9db0 in tensorflow::LaunchConv2DOp<Eigen::GpuDevice, float>::operator()(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, std::vector<long long, std::allocator<long long> > const&, tensorflow::Tensor*, tensorflow::TensorFormat) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x0000150eee5fa42c in tensorflow::Conv2DOp<Eigen::GpuDevice, float>::Compute(tensorflow::OpKernelContext*) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x0000150ee3987923 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#12 0x0000150ee3a572f9 in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::PropagatorState>::Process(tensorflow::PropagatorState::TaggedNode, long long) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#13 0x0000150ee3a584cf in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState<tensorflow::PropagatorState>::ScheduleReady(absl::lts_2020_02_25::InlinedVector<tensorflow::PropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::PropagatorState::TaggedNode> >*, tensorflow::PropagatorState::TaggedNodeReadyQueue*)::{lambda()#2}>::_M_invoke(std::_Any_data const&) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#14 0x0000150ee3b0a591 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#15 0x0000150ee3b072a3 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#16 0x0000150ee3af63e5 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#17 0x0000150f521fd6ba in start_thread (arg=0x150f32ef9700) at pthread_create.c:333
#18 0x0000150f51f334dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 9 (Thread 0x150f32cf8700 (LWP 19801)):
#0 0x0000150f51f34988 in accept4 (fd=6, addr=..., addr_len=0x150f32cf7e58, flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:40
#1 0x0000150eb70a438a in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x0000150eb709668d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x0000150eb70a5a58 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x0000150f521fd6ba in start_thread (arg=0x150f32cf8700) at pthread_create.c:333
#5 0x0000150f51f334dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
...
Thread 1 (Thread 0x150f52dbc700 (LWP 19750)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x0000150ef23b832b in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2 0x0000150ef23b7949 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3 0x0000150ef23b4f0b in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x0000150ef23b53e3 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x0000150ef18e540c in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) ()
from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
...
#8 0x0000150ef18f7101 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*, tensorflow::thread::ThreadPoolOptions const&) () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
...
#12 0x0000150ee8161c2a in TF_SessionRun () from /work/asr3/zeyer/merboldt/py-envs/py3.8-tf2.3/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
...
#17 0x0000150f5253cca6 in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, func=0x150f3a349450) at ../Objects/call.c:742
#18 PyCFunction_Call () at ../Objects/call.c:772
#19 0x0000150f5253b634 in _PyObject_MakeTpCall () at ../Objects/call.c:159
...
Thread 23 in exec_blas
is via Numpy via Librosa via our background MFCC feature extraction.
The first thread is just the canonical session.run(...)
.
@Spotlight0xff very interesting. I'm not sure if this is the same hang though. This TF issue (deadlock in fork()
because of OpenBlas) sounds very related. (I wonder now why I have not seen this problem myself anymore... Because I think I don't use the workaround described there.) You might want to try setting patch_atfork = True
in your config.
In the other stacktraces reported here, there is never something with fork()
or BLAS. Although, this was on Switchboard, so I used ExternSprintDataset
, i.e. there were/are also fork()
s (before every epoch). Maybe they did not cause an immediate hang but somehow trigger this delayed hang?
I am also getting this issue. Here a gist link of threads strace: https://gist.github.com/mmz33/dfe0eddf8ecab9777ac63715547e516e
This is with single GPU, no Horovod.
To debug this, it would be helpful if we somehow can investigate what the GPU is doing at that time (similarly as we are seeing what the CPU is doing).
There should be tools for this.
E.g. there is Nvidia Nsight. Some examples here for Attaching to a Running CUDA Process, Inspect State (Visual Studio / Windows specific, but this should also be possible on Linux).
@JackTemaki I heard you also got this? You did not comment here. @Marvin84 You as well? You also did not comment here. (It would have been useful to know that you also get this. Also for single GPU.)
@albertz yes correct I got this.
All of you always had this problem only when 2D convolution was involved, in the typical VGG-style prenet we have now in many models?
Also, you always get this only after min 1 day of runtime? Or even longer, more like 2.5 days? Is there some common pattern? Or is it more related to the absolute number of steps?
All the reports here are also for training, right?
Or even longer, more like 2.5 days
Any point during training, can even be the last day out of a week of training
All of you always had this problem only when 2D convolution
Good point, TTS only has 1D convolutions and I do not remember a hang there...
@Marvin84 You as well?
No, I did not experience this during my training. I do single gpu no conv. layer. However, I was interested to know whether this was related to horovod or not, since I plan on doing multi-GPU training.
@curufinwe pointed out that tf_session_opts = {'inter_op_parallelism_threads' : 1}
can potentially avoid the problem.
This is obviously not a real solution. Maybe it just hides the problem. But it is good to know if this would hide it, to further understand it.
inter_op_parallelism_threads=1
could also make it much slower in certain situations. You would never want to use that.
I am getting freezed jobs again with single GPU training. Here is the threads stacktrace log.
Again CudnnSupport::DoConvolve
, GpuTimer::GetElapsedMilliseconds
, and then it hangs in libcuda.so.1
, in clock_gettime
.
Environment: TF 2.3.4 and Cuda 10.1
source /work/tools/asr/python/3.8.0_tf_2.3.4-haswell+cuda10.1+mkl/bin/activate
export LD_LIBRARY_PATH="/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/extras/CUPTI/lib64:/usr/local/cudnn-10.1-v7.6/lib64:/usr/lib/nvidia-450"
export CUDA_HOME=/usr/local/cuda-10.1/
export CUDNN=/usr/local/cudnn-10.1-v7.6/
export OMP_NUM_THREADS=4
Environment: TF 2.3.4 and Cuda 10.1
Did we compile this TF? What GCC did we use?
Environment: TF 2.3.4 and Cuda 10.1
Did we compile this TF? What GCC did we use?
This question is the reason why I think it is not worth looking into this (The answer is yes and GCC 5.4). If this problem persists with new TF and new GCC versions then we should look at this again.
I updated my comment above since I had the path for generic compiled TF version but i was using the one compiled for haswell. However, i had this hanging issue also when using the generic compiled TF version before.
Recent RETURNN:
Horovod via SGE
-pe mpi 4
and thenmpirun
:Horovod settings:
This started fine, in epoch 1, and continued for many epochs, many hours, up until the end of epoch 101:
The last message repeats then every 60 seconds, and nothing else happens anymore:
When I login to the node of rank 3, and send a SIGUSR1, I get this traceback (via
faulthandler
; other threads omitted because irrelevant), i.e. we can see that it hangs insess.run
:Note that this is the standard step
sess.run
, i.e. nothing specific about Horovod. Actually, with these settings, there should be no Horovod op involved at all in this call.Then, the C traceback (via
gdb -p 1594 -ex 'thread apply all bt' -ex="set confirm off" -ex quit > gdblog.p1494.txt
) is here. Some of the (maybe interesting) threads (excluded areWaitForWork
or Python threads):CudnnSupport::DoConvolve
, or more specifically inGpuDriver::GetEventElapsedTime
, or even more specifically incuEventSynchronize
, which looks like a CUDA issue? Or maybe I misunderstand how thisGpuTimer
works. Or how is it used inDoConvolve
. Maybe this issue, or this? Also, if this is really a CUDA-related issue, why has it never occurred so far without Horovod?opal_libevent2021_event_base_loop
in two different threads?sess.run
of the main thread actually should not involve any Horovod op (with the given settings). So no MPI communication should run currently. However, there isPMPI_Allreduce
in thread 2. Why?Note that this is not the first time I'm seeing this. I already saw it a couple of times. Originally I hoped that this is some temporary issue in our cluster but there seems to be a real problem or bug. I don't really know whether it is on our side, or on OpenMPI (we have a quite old version), or on Horovod.