Open casparvl opened 4 years ago
On the other node, backtrace looks very similar, but a fraction different:
(gdb) bt
#0 0x00002ab9ea8b6ae1 in poll_device ()
from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_btl_openib.so
#1 0x00002ab9ea8b7ade in btl_openib_component_progress ()
from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_btl_openib.so
#2 0x00002ab9e49a886c in opal_progress ()
from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libopen-pal.so.40
#3 0x00002ab9e49aefb5 in ompi_sync_wait_mt ()
from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libopen-pal.so.40
#4 0x00002ab9eb78e970 in mca_pml_ob1_send ()
from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_pml_ob1.so
#5 0x00002ab9d79667db in ompi_coll_base_sendrecv_actual ()
from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libmpi.so.40
#6 0x00002ab9d7968507 in ompi_coll_base_allreduce_intra_ring_segmented ()
from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libmpi.so.40
#7 0x00002ab9eb7e33bc in ompi_coll_tuned_allreduce_intra_dec_fixed ()
from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/openmpi/mca_coll_tuned.so
#8 0x00002ab9d7925dff in PMPI_Allreduce ()
from /sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/OpenMPI/3.1.1-gcccuda-2018b/lib/libmpi.so.40
#9 0x00002ab9fc2e97d9 in horovod::common::MPIAllreduce::Execute (this=0x2aba14012b10,
---Type <return> to continue, or q <return> to quit---
entries=..., response=...) at horovod/common/ops/mpi_operations.cc:52
#10 0x00002ab9fc2c1cd1 in horovod::common::OperationManager::ExecuteAllreduce (
this=this@entry=0x2aba14016890, entries=..., response=...)
at horovod/common/ops/operation_manager.cc:41
#11 0x00002ab9fc2c2061 in horovod::common::OperationManager::ExecuteOperation (
this=0x2aba14016890, entries=..., response=...) at horovod/common/ops/operation_manager.cc:90
#12 0x00002ab9fc29d495 in PerformOperation (state=...,
response=<error reading variable: access outside bounds of object referenced via synthetic pointer>) at horovod/common/operations.cc:295
#13 RunLoopOnce (state=...) at horovod/common/operations.cc:585
#14 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...)
at horovod/common/operations.cc:509
#15 0x00002ab9e47f3edf in std::execute_native_thread_routine (__p=0x595f860)
at ../../../../../libstdc++-v3/src/c++11/thread.cc:83
#16 0x00002ab9b781eea5 in start_thread () from /usr/lib64/libpthread.so.0
#17 0x00002ab9b7d348cd in clone () from /usr/lib64/libc.so.6
Just to give a different interface a try, I disabled the openib btl and run over TCP. I had to set btl_tcp_if_include
; for now I set it to a /24, but would need to set to a larger address block for it to work 'anywhere' on Cartesius (now, I got the same 2 nodes everytime, so that was easy):
mpirun --mca btl ^openib --mca btl_tcp_if_include 10.200.202.0/24 --map-by node --bind-to none -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x LD_LIBRARY_PATH -x PATH -x TF_USE_CUDNN -x OMP_NUM_THREADS \
python -u main.py pgan /projects/2/managed_datasets/LIDC-IDRI/npy/average/ '(1, 128, 512, 512)' --starting_phase 7 --ending_phase 8 --latent_dim 512 --horovod --scratch_path /scratch-shared/$USER --base_batch_size 32 --network_size m --starting_alpha 1 --loss_fn wgan --gp_weight 10 --d_lr 5e-5 --g_lr 5e-5 --continue_path $CONTINUE_PATH --num_inter_ops 1
Output:
Variables restored!
Broadcasting initial global variables...
Variables restored!
Begin mixing epochs in phase 7
Broadcasting initial global variables...
Broadcast completed
Batching...
Broadcast completed
Batching...
Got a batch!
Got a batch!
Completed step
Batching...
Got a batch!
Completed step
Step 000000002 img/s 0.02 d_loss -3069.9963 g_loss 2908.5449 alpha 1.00
Batching...
Got a batch!
Completed step
Batching...
Got a batch!
Completed step
Step 000000004 img/s 0.02 d_loss -2550.9722 g_loss 2382.4214 alpha 1.00
Batching...
Got a batch!
Ok, so speed doesn't seem great, since before I also got 0.02 img/s on a single node. Looking at the CPU utilization, and the output of ifconfig, it doesn't seem like I'm spending too much time in communication though... Maybe if I run 2 tasks per node and bind to socket, performance will be better.
Now changed to:
#SBATCH -N 2
#SBATCH --ntasks-per-node=2
...
mpirun --mca btl ^openib --mca btl_tcp_if_include 10.200.0.0/16 --map-by ppr:1:socket:PE=12 -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x LD_LIBRARY_PATH -x PATH -x TF_USE_CUDNN -x OMP_NUM_THREADS \
python -u main.py pgan /projects/2/managed_datasets/LIDC-IDRI/npy/average/ '(1, 128, 512, 512)' --starting_phase 7 --ending_phase 8 --latent_dim 512 --horovod --scratch_path /scratch-shared/$USER --base_batch_size 32 --network_size m --starting_alpha 1 --loss_fn wgan --gp_weight 10 --d_lr 5e-5 --g_lr 5e-5 --continue_path $CONTINUE_PATH --num_inter_ops 1
With this proper 2 tasks per node, and mapping to socket, we're getting much better performance:
Got a batch!
Got a batch!
Got a batch!
Got a batch!
Completed step
Completed step
Step 000000004 img/s 0.05 d_loss -5233.3760 g_loss 4992.5361 alpha 1.00
Completed step
Batching...
Batching...
Got a batch!
Got a batch!
Batching...
Completed step
Got a batch!
Batching...
Got a batch!
Completed step
Step 000000008 img/s 0.06 d_loss -3895.0845 g_loss 3727.0601 alpha 1.00
Completed step
Batching...
Batching...
Got a batch!
Got a batch!
Completed step
Batching...
Got a batch!
Completed step
Batching...
Got a batch!
Completed step
Step 000000012 img/s 0.06 d_loss -5496.6875 g_loss 5435.4736 alpha 1.00
It seems that using TCP instead of the OpenIB BTL does not impact performance too much.
Issue
The code hangs when running
Initially, I see 100 CPU usage on all cores. Then, after a while, it drops to 100% on 1 core per task (i.e. one core on each of the nodes). This is the thing you expect for e.g. a communication deadlock, since an MPI thread waiting for communication shows 100% CPU.
In
htop
, I see26186 is the main process, while 26205 is one of the threads. The main process seems to be in state S, while the thread is in state R.
Attaching GDB using
gdb -p 26186
:Huge stacktrace, at the top being some TensorFlow function "nsync::nsync_mu_semaphore_p_with_deadline".
If however I connect to the OTHER thread with
gdb -p 26205
, I get:Now that is more what we expect: I guess this thread is waiting for the MPIAllreduce to complete. It is completely unclear why this finishes correctly when doing 2 tasks per node on a single node, but does not finish when doing 2 tasks per node on multinode.