rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 130 forks source link

Horovod hangs at init #1195

Closed albertz closed 2 years ago

albertz commented 2 years ago

I use SGE parallel environment (i.e. multi-node) -pe mpi 2.

This is the stdout:

+------- PROLOGUE SCRIPT ----------------------------------------------- 
|
| Job ID ...........: 7138626
| Started at .......: Tue Nov  1 12:36:53 CET 2022
| Execution host ...: cluster-cn-286
| Cluster queue ....: 8-GPU-2080
| Script ...........: /var/spool/sge/cluster-cn-286/job_scripts/7138626 
| > /u/zeyer/tools/bin/python3 /u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sis worker --engine long work/
i6_core/returnn/training/ReturnnTrainingJob.Hqniupwf69PX run 
| GPU allocation:
| {'gpus': {0: b'7138620.1', 
|           1: b'7138626.1',
|           3: b'7138520.1', 
|           5: b'7137686.1',
|           6: b'7138626.1', 
|           7: b'7137191.1'},
|  'total_gpus': 8} 
| Running jobs:
| HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS 
| ----------------------------------------------------------------------------------------------
| global                  -               -    -    -    -     -       -       -       -       -
| cluster-cn-286          lx-amd64       64    2   32   64  6.52  251.8G   17.4G    3.8G     0.0
|    8-GPU-2080           BIP   0/6/8          
|    7137191 0.31343 qsubmit_fu schuemann    r     10/31/2022 17:08:22 MASTER        
|    7137686 0.25368 i6_core.re mann         r     10/31/2022 23:26:52 MASTER 1 
|    7138520 0.37686 crnn.sprin jxu          r     11/01/2022 11:20:22 MASTER 1
|    7138620 0.33457 i6_nlu.fai njain        r     11/01/2022 12:34:02 MASTER 1
|    7138626 0.27720 i6_core.re zeyer        r     11/01/2022 12:36:52 MASTER 1
|                                                                      SLAVE  1 
| GPU_DEBUG_POST_ALLOC={'gpus':{0:b'7138620.1',1:b'7138626.1',3:b'7138520.1',5:b'7137686.1',6:b'7138626.1',7:b'7
137191.1'},'total_gpus':8}
| GPU_DEBUG_JOB=7138626.1 
| GPU_DEBUG_HOST=cluster-cn-286
| GPU_DEBUG_PREV_ALLOC={'gpus':{0:b'7138620.1',3:b'7138520.1',5:b'7137686.1',7:b'7137191.1'},'total_gpus':8} 
|
+------- PROLOGUE SCRIPT -----------------------------------------------
...
Uname: uname_result(system='Linux', node='cluster-cn-286', release='4.15.0-46-generic', version='#49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019', machine='x86_64', processor='x86_64')
Load: (6.31, 6.49, 6.53)
[2022-11-01 12:36:55,679] INFO: ------------------------------------------------------------
[2022-11-01 12:36:55,679] INFO: Starting subtask for arg id: 0 args: []
[2022-11-01 12:36:55,679] INFO: ------------------------------------------------------------
[2022-11-01 12:36:55,710] INFO: Run time: 0:00:00 CPU: 109.90% RSS: 64MB VMS: 386MB
2022-11-01 12:36:57.760608: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2022-11-01 12:36:57.780944: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[2022-11-01 12:37:00,729] INFO: Run time: 0:00:05 CPU: 0.20% RSS: 555MB VMS: 2.42GB
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
Horovod: 0.19.5 /u/zeyer/.local/lib/python3.8/site-packages/horovod/__init__.py
Horovod: 0.19.5 /u/zeyer/.local/lib/python3.8/site-packages/horovod/__init__.py
[2022-11-01 12:37:05,743] INFO: Run time: 0:00:10 CPU: 0.40% RSS: 662MB VMS: 3.39GB

It hangs there.

Current Python stacktrace:

% py-spy dump -p 65023
Process 65023: /u/zeyer/tools/bin/python3 /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Hqniupwf69PX/output/returnn.config
Python v3.8.0 (/work/tools/asr/python/3.8.0/bin/python3.8)

Thread 65023 (idle): "MainThread"
    init (horovod/common/basics.py:64)
    __init__ (returnn/returnn/tf/horovod.py:40)
    get_ctx (returnn/returnn/tf/horovod.py:174)
    init_by_config (returnn/returnn/log.py:189)
    init_log (returnn/returnn/__main__.py:128)
    init (returnn/returnn/__main__.py:342)
    main (returnn/returnn/__main__.py:563)
    <module> (returnn/rnn.py:11)

I.e. it hangs in the init() from Horovod:

    import horovod.tensorflow as hvd
    hvd.init()
albertz commented 2 years ago

Ok, with Horovod 0.26.1 this problem seems to be gone. But maybe it's also because this recompiled it to the current TensorFlow I was using. Maybe the old Horovod was compiled to a different TensorFlow version.