Please describe the system on which you are running
Operating system/version: Ubuntu 18.04
Network type: k8s
Details of the problem
Training terminated after a while,
# error message:
... (normal training process)
...
[TR] rank: 5, norm: 1646.42, matches: 115, utts: 256, avg loss: 1.1620, batches: 860
[TR] rank: 2, norm: 1646.42, matches: 119, utts: 256, avg loss: 1.0515, batches: 860
[TR] rank: 0, norm: 1646.42, matches: 130, utts: 256, avg loss: 1.0197, batches: 860
[TR] rank: 7, norm: 1646.42, matches: 123, utts: 256, avg loss: 1.0018, batches: 860
[TR] rank: 3, norm: 1646.42, matches: 120, utts: 256, avg loss: 1.0964, batches: 860
[TR] rank: 4, norm: 1646.42, matches: 131, utts: 256, avg loss: 1.0161, batches: 860
[TR] rank: 6, norm: 1637.52, matches: 133, utts: 256, avg loss: 0.9388, batches: 870
[TR] rank: 1, norm: 1637.52, matches: 140, utts: 256, avg loss: 0.9769, batches: 870
[TR] rank: 2, norm: 1637.52, matches: 130, utts: 256, avg loss: 1.0458, batches: 870
[TR] rank: 4, norm: 1637.52, matches: 138, utts: 256, avg loss: 0.9570, batches: 870
[TR] rank: 5, norm: 1637.52, matches: 118, utts: 256, avg loss: 1.0001, batches: 870
[TR] rank: 0, norm: 1637.52, matches: 109, utts: 256, avg loss: 1.0895, batches: 870
[TR] rank: 3, norm: 1637.52, matches: 128, utts: 256, avg loss: 1.0204, batches: 870
[TR] rank: 7, norm: 1637.52, matches: 134, utts: 256, avg loss: 0.9811, batches: 870
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** An error occurred in Socket closed
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** reported by process [1778253825,2]
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** on a NULL communicator
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** Unknown error
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** and MPI will try to terminate your MPI job as well)
^@--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: job-170078988023782404088-yihua-zhou-master-0
Local PID: 62
Peer host: job-170078988023782404088-yihua-zhou-worker-2
--------------------------------------------------------------------------
^@/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
len(cache))
Hi @SimZhou , this could very be outside of OpenMPI, do you have any indication that this is specifically ompi v5.0 related? Does it work with older ompi versions?
Background information
OpenMPI: 5.0.0, built from source
Please describe the system on which you are running
Details of the problem
Training terminated after a while,