Code hangs when running with pgan size m on two Cartesius CPU nodes

casparvl commented 4 years ago

Issue

The code hangs when running

Nodes: 2
Tasks per node: 1
Model: pgan
Size: m
starting_phase 7
ending_phase 8 on Cartesius CPU nodes.

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1
#SBATCH -c 24
#SBATCH -p short
#SBATCH -t 1:00:00
#SBATCH --constraint=[island1|island2|island3|island4|island5|fat]

# To provide the python deps other than TensorFlow:
source ~/venvs/surfgan/bin/activate
module purge
module load 2019
module load Horovod/0.19.4-fosscuda-2018b-TensorFlow-1.15.3-Python-3.6.6
export OMP_NUM_THREADS=23

cd /home/$USER/NKI/saraGAN/SURFGAN_3D/
CONTINUE_PATH=/home/$USER/NKI/saraGAN/SURFGAN_3D/runs/model_6_medium/model_6

export TF_USE_CUDNN=0

mpirun -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x LD_LIBRARY_PATH -x PATH -x TF_USE_CUDNN -x OMP_NUM_THREADS \
        python -u main.py pgan /projects/2/managed_datasets/LIDC-IDRI/npy/average/ '(1, 128, 512, 512)' --starting_phase 7 --ending_phase 8 --latent_dim 512 --horovod  --scratch_path /scratch-shared/$USER --base_batch_size 32 --network_size m --starting_alpha 1 --loss_fn wgan --gp_weight 10 --d_lr 5e-5 --g_lr 5e-5 --continue_path $CONTINUE_PATH --num_inter_ops 1

Initially, I see 100 CPU usage on all cores. Then, after a while, it drops to 100% on 1 core per task (i.e. one core on each of the nodes). This is the thing you expect for e.g. a communication deadlock, since an MPI thread waiting for communication shows 100% CPU.

In htop, I see

26205 ...    20   0 20.1G 13.6G 91976 R 100. 21.8  3:21.32 python -u main.py pgan /projects/2/managed_datasets/LIDC-IDRI/npy/average/ (1, 128, 512, 512) --starting_phase 7 --ending_phase 8 --latent
26186 ...    20   0 20.1G 13.6G 91976 S 100. 21.8 12:41.02 python -u main.py pgan /projects/2/managed_datasets/LIDC-IDRI/npy/average/ (1, 128, 512, 512) --starting_phase 7 --ending_phase 8 --latent

26186 is the main process, while 26205 is one of the threads. The main process seems to be in state S, while the thread is in state R.

Attaching GDB using gdb -p 26186:

#0  0x00002ba39b38cbe9 in syscall () from /usr/lib64/libc.so.6
#1  0x00002ba3abfe9ddd in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#2  0x00002ba3abfe93c9 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#3  0x00002ba3abfe697b in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#4  0x00002ba3abfe6e53 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#5  0x00002ba3aec7f98c in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#6  0x00002ba3aec7f9db in tensorflow::DirectSession::WaitForNotification(tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#7  0x00002ba3aec898db in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tensorflow::thread::ThreadPoolOptions const&) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#8  0x00002ba3aec93b04 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#9  0x00002ba3a93a3a92 in tensorflow::SessionRef::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#10 0x00002ba3a9c7a1ab in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, ---Type <return> to continue, or q <return> to quit---
std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Tensor**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Buffer*, TF_Status*) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#11 0x00002ba3a9c7b146 in TF_SessionRun ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#12 0x00002ba3a93a0eaa in tensorflow::TF_SessionRun_wrapper_helper(TF_Session*, char const*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#13 0x00002ba3a93a0f42 in tensorflow::TF_SessionRun_wrapper(TF_Session*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#14 0x00002ba3a935d744 in _wrap_TF_SessionRun_wrapper ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/TensorFlow/1.15.3-fosscuda-2018b-Python-3.6.6/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#15 0x00002ba39a26d060 in _PyCFunction_FastCallDict () at Objects/methodobject.c:234
#16 0x00002ba39a2c248b in call_function () at Python/ceval.c:4830
#17 0x00002ba39a2bb3ed in _PyEval_EvalFrameDefault () at Python/ceval.c:3328
#18 0x00002ba39a2c282a in _PyFunction_FastCall (globals=<optimized out>, nargs=<optimized out>,
    args=<optimized out>, co=<optimized out>) at Python/ceval.c:4912
#19 fast_function () at Python/ceval.c:4947
#20 0x00002ba39a2c2557 in call_function () at Python/ceval.c:4851
#21 0x00002ba39a2bb3ed in _PyEval_EvalFrameDefault () at Python/ceval.c:3328
#22 0x00002ba39a2ba8d5 in _PyEval_EvalCodeWithName (qualname=0x0, name=<optimized out>,
    closure=0x2ba4be280400, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, kwcount=<optimized out>,
    kwargs=<optimized out>, kwnames=<optimized out>, argcount=<optimized out>,
    args=<optimized out>, locals=<optimized out>, globals=<optimized out>, _co=0x2ba3c9fd68a0)
    at Python/ceval.c:4159
#23 PyEval_EvalCodeEx () at Python/ceval.c:4180
#24 0x00002ba39a24af73 in function_call () at Objects/funcobject.c:604
#25 0x00002ba39a23658b in PyObject_Call () at Objects/abstract.c:2261
#26 0x00002ba39a2bd2b6 in do_call_core (kwdict=0x0, callargs=0x2ba4be063a40, func=0x2ba4beb8f048)
    at Python/ceval.c:5099
#27 _PyEval_EvalFrameDefault () at Python/ceval.c:3397
#28 0x00002ba39a2c2a9f in _PyEval_EvalCodeWithName (qualname=0x2ba3c9fd9468, name=<optimized out>,
    closure=0x0, kwdefs=0x0, defcount=<optimized out>, defs=0x0, kwstep=1, kwcount=0,
    kwargs=<optimized out>, kwnames=0x0, argcount=<optimized out>, args=<optimized out>,
    locals=0x0, globals=<optimized out>, _co=0x2ba3c9fd6a50) at Python/ceval.c:4159
#29 fast_function () at Python/ceval.c:4971
#30 0x00002ba39a2c2557 in call_function () at Python/ceval.c:4851
#31 0x00002ba39a2bb3ed in _PyEval_EvalFrameDefault () at Python/ceval.c:3328
#32 0x00002ba39a2c3045 in _PyEval_EvalCodeWithName (qualname=0x2ba3c9fd9390, name=<optimized out>,
    closure=0x0, kwdefs=0x0, defcount=<optimized out>, defs=0x0, kwstep=1,
    kwcount=<optimized out>, kwargs=<optimized out>, kwnames=0x0, argcount=<optimized out>,
    args=<optimized out>, locals=0x0, globals=<optimized out>, _co=0x2ba3c9fd69c0)
    at Python/ceval.c:4159
#33 fast_function () at Python/ceval.c:4971
#34 0x00002ba39a2c2557 in call_function () at Python/ceval.c:4851
#35 0x00002ba39a2bb3ed in _PyEval_EvalFrameDefault () at Python/ceval.c:3328
#36 0x00002ba39a2c282a in _PyFunction_FastCall (globals=<optimized out>, nargs=<optimized out>,
    args=<optimized out>, co=<optimized out>) at Python/ceval.c:4912
#37 fast_function () at Python/ceval.c:4947
#38 0x00002ba39a2c2557 in call_function () at Python/ceval.c:4851
#39 0x00002ba39a2bb3ed in _PyEval_EvalFrameDefault () at Python/ceval.c:3328
#40 0x00002ba39a2c2a9f in _PyEval_EvalCodeWithName (qualname=0x2ba3c9da39b0, name=<optimized out>,
    closure=0x0, kwdefs=0x0, defcount=<optimized out>, defs=0x2ba3c9fd9888, kwstep=1, kwcount=1,
    kwargs=<optimized out>, kwnames=0x2ba3a1ef1f40, argcount=<optimized out>,
    args=<optimized out>, locals=0x0, globals=<optimized out>, _co=0x2ba3c9da4c00)
    at Python/ceval.c:4159
#41 fast_function () at Python/ceval.c:4971
#42 0x00002ba39a2c2557 in call_function () at Python/ceval.c:4851
#43 0x00002ba39a2bc5f1 in _PyEval_EvalFrameDefault () at Python/ceval.c:3344
#44 0x00002ba39a2c3045 in _PyEval_EvalCodeWithName (qualname=0x2ba3a1c1aa40, name=<optimized out>,
    closure=0x0, kwdefs=0x0, defcount=<optimized out>, defs=0x0, kwstep=1,
    kwcount=<optimized out>, kwargs=<optimized out>, kwnames=0x0, argcount=<optimized out>,
    args=<optimized out>, locals=0x0, globals=<optimized out>, _co=0x2ba3a1cd1f60)
    at Python/ceval.c:4159
#45 fast_function () at Python/ceval.c:4971
#46 0x00002ba39a2c2557 in call_function () at Python/ceval.c:4851
#47 0x00002ba39a2bb3ed in _PyEval_EvalFrameDefault () at Python/ceval.c:3328
#48 0x00002ba39a2ba59d in _PyEval_EvalCodeWithName (qualname=0x0, name=<optimized out>,
    closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, kwcount=<optimized out>,
    kwargs=<optimized out>, kwnames=<optimized out>, argcount=<optimized out>,
    args=<optimized out>, locals=<optimized out>, globals=<optimized out>, _co=0x2ba3a1dbe030)
    at Python/ceval.c:4159
#49 PyEval_EvalCodeEx () at Python/ceval.c:4180
#50 0x00002ba39a2ba38b in PyEval_EvalCode (co=co@entry=0x2ba3a1dbe030,
    globals=globals@entry=0x2ba39a07e510, locals=locals@entry=0x2ba39a07e510) at Python/ceval.c:731
#51 0x00002ba39a33f2b2 in run_mod () at Python/pythonrun.c:1025
#52 0x00002ba39a33f6ad in PyRun_FileExFlags () at Python/pythonrun.c:978
#53 0x00002ba39a33f4c7 in PyRun_SimpleFileExFlags () at Python/pythonrun.c:420
#54 0x00002ba39a345f41 in run_file (p_cf=0x7ffd46a11f2c, filename=0x16594f0 L"main.py",
    fp=0x1745480) at Modules/main.c:340
#55 Py_Main () at Modules/main.c:810
#56 0x0000000000401978 in main () at ./Programs/python.c:69
#57 0x00002ba39b2b6545 in __libc_start_main () from /usr/lib64/libc.so.6
#58 0x0000000000401ae7 in _start ()

Huge stacktrace, at the top being some TensorFlow function "nsync::nsync_mu_semaphore_p_with_deadline".

If however I connect to the OTHER thread with gdb -p 26205, I get:

#0  0x00002ba39ae815b0 in pthread_spin_unlock () from /usr/lib64/libpthread.so.0
#1  0x00002ba3d52e42ca in mlx5_poll_cq () from /lib64/libmlx5-rdmav2.so
#2  0x00002ba3cbfb4b41 in poll_device () from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_btl_openib.so
#3  0x00002ba3cbfb5ade in btl_openib_component_progress () from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_btl_openib.so
#4  0x00002ba3c800686c in opal_progress () from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libopen-pal.so.40
#5  0x00002ba3c800cfb5 in ompi_sync_wait_mt () from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libopen-pal.so.40
#6  0x00002ba3d5791970 in mca_pml_ob1_send () from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_pml_ob1.so
#7  0x00002ba3bafc47db in ompi_coll_base_sendrecv_actual () from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libmpi.so.40
#8  0x00002ba3bafc6507 in ompi_coll_base_allreduce_intra_ring_segmented () from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libmpi.so.40
#9  0x00002ba3d57e63bc in ompi_coll_tuned_allreduce_intra_dec_fixed () from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_coll_tuned.so
#10 0x00002ba3baf83dff in PMPI_Allreduce () from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libmpi.so.40
#11 0x00002ba3df9157d9 in horovod::common::MPIAllreduce::Execute (this=0x2ba3f40128b0, entries=..., response=...) at horovod/common/ops/mpi_operations.cc:52
#12 0x00002ba3df8edcd1 in horovod::common::OperationManager::ExecuteAllreduce (this=this@entry=0x2ba3f4015b80, entries=..., response=...) at horovod/common/ops/operation_manager.cc:41
#13 0x00002ba3df8ee061 in horovod::common::OperationManager::ExecuteOperation (this=0x2ba3f4015b80, entries=..., response=...) at horovod/common/ops/operation_manager.cc:90
#14 0x00002ba3df8c9495 in PerformOperation (state=..., response=<error reading variable: access outside bounds of object referenced via synthetic pointer>) at horovod/common/operations.cc:295
#15 RunLoopOnce (state=...) at horovod/common/operations.cc:585
#16 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at horovod/common/operations.cc:509
#17 0x00002ba3c7e51edf in std::execute_native_thread_routine (__p=0x52771b0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:83
#18 0x00002ba39ae7cea5 in start_thread () from /usr/lib64/libpthread.so.0
#19 0x00002ba39b3928cd in clone () from /usr/lib64/libc.so.6

Now that is more what we expect: I guess this thread is waiting for the MPIAllreduce to complete. It is completely unclear why this finishes correctly when doing 2 tasks per node on a single node, but does not finish when doing 2 tasks per node on multinode.

casparvl commented 4 years ago

On the other node, backtrace looks very similar, but a fraction different:

(gdb) bt
#0  0x00002ab9ea8b6ae1 in poll_device ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_btl_openib.so
#1  0x00002ab9ea8b7ade in btl_openib_component_progress ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_btl_openib.so
#2  0x00002ab9e49a886c in opal_progress ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libopen-pal.so.40
#3  0x00002ab9e49aefb5 in ompi_sync_wait_mt ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libopen-pal.so.40
#4  0x00002ab9eb78e970 in mca_pml_ob1_send ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_pml_ob1.so
#5  0x00002ab9d79667db in ompi_coll_base_sendrecv_actual ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libmpi.so.40
#6  0x00002ab9d7968507 in ompi_coll_base_allreduce_intra_ring_segmented ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libmpi.so.40
#7  0x00002ab9eb7e33bc in ompi_coll_tuned_allreduce_intra_dec_fixed ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_coll_tuned.so
#8  0x00002ab9d7925dff in PMPI_Allreduce ()
   from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libmpi.so.40
#9  0x00002ab9fc2e97d9 in horovod::common::MPIAllreduce::Execute (this=0x2aba14012b10,
---Type <return> to continue, or q <return> to quit---
    entries=..., response=...) at horovod/common/ops/mpi_operations.cc:52
#10 0x00002ab9fc2c1cd1 in horovod::common::OperationManager::ExecuteAllreduce (
    this=this@entry=0x2aba14016890, entries=..., response=...)
    at horovod/common/ops/operation_manager.cc:41
#11 0x00002ab9fc2c2061 in horovod::common::OperationManager::ExecuteOperation (
    this=0x2aba14016890, entries=..., response=...) at horovod/common/ops/operation_manager.cc:90
#12 0x00002ab9fc29d495 in PerformOperation (state=...,
    response=<error reading variable: access outside bounds of object referenced via synthetic pointer>) at horovod/common/operations.cc:295
#13 RunLoopOnce (state=...) at horovod/common/operations.cc:585
#14 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...)
    at horovod/common/operations.cc:509
#15 0x00002ab9e47f3edf in std::execute_native_thread_routine (__p=0x595f860)
    at ../../../../../libstdc++-v3/src/c++11/thread.cc:83
#16 0x00002ab9b781eea5 in start_thread () from /usr/lib64/libpthread.so.0
#17 0x00002ab9b7d348cd in clone () from /usr/lib64/libc.so.6

casparvl commented 4 years ago

Just to give a different interface a try, I disabled the openib btl and run over TCP. I had to set btl_tcp_if_include; for now I set it to a /24, but would need to set to a larger address block for it to work 'anywhere' on Cartesius (now, I got the same 2 nodes everytime, so that was easy):

mpirun --mca btl ^openib --mca btl_tcp_if_include 10.200.202.0/24 --map-by node --bind-to none -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x LD_LIBRARY_PATH -x PATH -x TF_USE_CUDNN -x OMP_NUM_THREADS \
        python -u main.py pgan /projects/2/managed_datasets/LIDC-IDRI/npy/average/ '(1, 128, 512, 512)' --starting_phase 7 --ending_phase 8 --latent_dim 512 --horovod  --scratch_path /scratch-shared/$USER --base_batch_size 32 --network_size m --starting_alpha 1 --loss_fn wgan --gp_weight 10 --d_lr 5e-5 --g_lr 5e-5 --continue_path $CONTINUE_PATH --num_inter_ops 1

Output:

Variables restored!
Broadcasting initial global variables...
Variables restored!
Begin mixing epochs in phase 7
Broadcasting initial global variables...
Broadcast completed
Batching...
Broadcast completed
Batching...
Got a batch!
Got a batch!
Completed step
Batching...
Got a batch!
Completed step
Step 000000002  img/s 0.02       d_loss -3069.9963       g_loss 2908.5449        alpha 1.00
Batching...
Got a batch!
Completed step
Batching...
Got a batch!
Completed step
Step 000000004  img/s 0.02       d_loss -2550.9722       g_loss 2382.4214        alpha 1.00
Batching...
Got a batch!

Ok, so speed doesn't seem great, since before I also got 0.02 img/s on a single node. Looking at the CPU utilization, and the output of ifconfig, it doesn't seem like I'm spending too much time in communication though... Maybe if I run 2 tasks per node and bind to socket, performance will be better.

casparvl commented 4 years ago

Now changed to:

#SBATCH -N 2
#SBATCH --ntasks-per-node=2
...
mpirun --mca btl ^openib --mca btl_tcp_if_include 10.200.0.0/16 --map-by ppr:1:socket:PE=12 -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x LD_LIBRARY_PATH -x PATH -x TF_USE_CUDNN -x OMP_NUM_THREADS \
        python -u main.py pgan /projects/2/managed_datasets/LIDC-IDRI/npy/average/ '(1, 128, 512, 512)' --starting_phase 7 --ending_phase 8 --latent_dim 512 --horovod  --scratch_path /scratch-shared/$USER --base_batch_size 32 --network_size m --starting_alpha 1 --loss_fn wgan --gp_weight 10 --d_lr 5e-5 --g_lr 5e-5 --continue_path $CONTINUE_PATH --num_inter_ops 1

With this proper 2 tasks per node, and mapping to socket, we're getting much better performance:

Got a batch!
Got a batch!
Got a batch!
Got a batch!
Completed step
Completed step
Step 000000004  img/s 0.05       d_loss -5233.3760       g_loss 4992.5361        alpha 1.00
Completed step
Batching...
Batching...
Got a batch!
Got a batch!
Batching...
Completed step
Got a batch!
Batching...
Got a batch!
Completed step
Step 000000008  img/s 0.06       d_loss -3895.0845       g_loss 3727.0601        alpha 1.00
Completed step
Batching...
Batching...
Got a batch!
Got a batch!
Completed step
Batching...
Got a batch!
Completed step
Batching...
Got a batch!
Completed step
Step 000000012  img/s 0.06       d_loss -5496.6875       g_loss 5435.4736        alpha 1.00

It seems that using TCP instead of the OpenIB BTL does not impact performance too much.

sara-nl / saraGAN

Code hangs when running with pgan size m on two Cartesius CPU nodes #44