Closed albertz closed 2 years ago
I use SGE parallel environment (i.e. multi-node) -pe mpi 2.
-pe mpi 2
This is the stdout:
+------- PROLOGUE SCRIPT ----------------------------------------------- | | Job ID ...........: 7138626 | Started at .......: Tue Nov 1 12:36:53 CET 2022 | Execution host ...: cluster-cn-286 | Cluster queue ....: 8-GPU-2080 | Script ...........: /var/spool/sge/cluster-cn-286/job_scripts/7138626 | > /u/zeyer/tools/bin/python3 /u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sis worker --engine long work/ i6_core/returnn/training/ReturnnTrainingJob.Hqniupwf69PX run | GPU allocation: | {'gpus': {0: b'7138620.1', | 1: b'7138626.1', | 3: b'7138520.1', | 5: b'7137686.1', | 6: b'7138626.1', | 7: b'7137191.1'}, | 'total_gpus': 8} | Running jobs: | HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS | ---------------------------------------------------------------------------------------------- | global - - - - - - - - - - | cluster-cn-286 lx-amd64 64 2 32 64 6.52 251.8G 17.4G 3.8G 0.0 | 8-GPU-2080 BIP 0/6/8 | 7137191 0.31343 qsubmit_fu schuemann r 10/31/2022 17:08:22 MASTER | 7137686 0.25368 i6_core.re mann r 10/31/2022 23:26:52 MASTER 1 | 7138520 0.37686 crnn.sprin jxu r 11/01/2022 11:20:22 MASTER 1 | 7138620 0.33457 i6_nlu.fai njain r 11/01/2022 12:34:02 MASTER 1 | 7138626 0.27720 i6_core.re zeyer r 11/01/2022 12:36:52 MASTER 1 | SLAVE 1 | GPU_DEBUG_POST_ALLOC={'gpus':{0:b'7138620.1',1:b'7138626.1',3:b'7138520.1',5:b'7137686.1',6:b'7138626.1',7:b'7 137191.1'},'total_gpus':8} | GPU_DEBUG_JOB=7138626.1 | GPU_DEBUG_HOST=cluster-cn-286 | GPU_DEBUG_PREV_ALLOC={'gpus':{0:b'7138620.1',3:b'7138520.1',5:b'7137686.1',7:b'7137191.1'},'total_gpus':8} | +------- PROLOGUE SCRIPT ----------------------------------------------- ... Uname: uname_result(system='Linux', node='cluster-cn-286', release='4.15.0-46-generic', version='#49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019', machine='x86_64', processor='x86_64') Load: (6.31, 6.49, 6.53) [2022-11-01 12:36:55,679] INFO: ------------------------------------------------------------ [2022-11-01 12:36:55,679] INFO: Starting subtask for arg id: 0 args: [] [2022-11-01 12:36:55,679] INFO: ------------------------------------------------------------ [2022-11-01 12:36:55,710] INFO: Run time: 0:00:00 CPU: 109.90% RSS: 64MB VMS: 386MB 2022-11-01 12:36:57.760608: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2022-11-01 12:36:57.780944: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 [2022-11-01 12:37:00,729] INFO: Run time: 0:00:05 CPU: 0.20% RSS: 555MB VMS: 2.42GB WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py'). WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py'). Horovod: 0.19.5 /u/zeyer/.local/lib/python3.8/site-packages/horovod/__init__.py Horovod: 0.19.5 /u/zeyer/.local/lib/python3.8/site-packages/horovod/__init__.py [2022-11-01 12:37:05,743] INFO: Run time: 0:00:10 CPU: 0.40% RSS: 662MB VMS: 3.39GB
It hangs there.
Current Python stacktrace:
% py-spy dump -p 65023 Process 65023: /u/zeyer/tools/bin/python3 /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Hqniupwf69PX/output/returnn.config Python v3.8.0 (/work/tools/asr/python/3.8.0/bin/python3.8) Thread 65023 (idle): "MainThread" init (horovod/common/basics.py:64) __init__ (returnn/returnn/tf/horovod.py:40) get_ctx (returnn/returnn/tf/horovod.py:174) init_by_config (returnn/returnn/log.py:189) init_log (returnn/returnn/__main__.py:128) init (returnn/returnn/__main__.py:342) main (returnn/returnn/__main__.py:563) <module> (returnn/rnn.py:11)
I.e. it hangs in the init() from Horovod:
init()
import horovod.tensorflow as hvd hvd.init()
Ok, with Horovod 0.26.1 this problem seems to be gone. But maybe it's also because this recompiled it to the current TensorFlow I was using. Maybe the old Horovod was compiled to a different TensorFlow version.
I use SGE parallel environment (i.e. multi-node)
-pe mpi 2
.This is the stdout:
It hangs there.
Current Python stacktrace:
I.e. it hangs in the
init()
from Horovod: