open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 859 forks source link

[5.0.0] MPI communication peer process has unexpectedly disconnected #12124

Open SimZhou opened 11 months ago

SimZhou commented 11 months ago

Background information

OpenMPI: 5.0.0, built from source

Please describe the system on which you are running


Details of the problem

Training terminated after a while,

# error message:
... (normal training process)
...
[TR] rank: 5, norm: 1646.42, matches: 115, utts: 256, avg loss: 1.1620, batches: 860
[TR] rank: 2, norm: 1646.42, matches: 119, utts: 256, avg loss: 1.0515, batches: 860
[TR] rank: 0, norm: 1646.42, matches: 130, utts: 256, avg loss: 1.0197, batches: 860
[TR] rank: 7, norm: 1646.42, matches: 123, utts: 256, avg loss: 1.0018, batches: 860
[TR] rank: 3, norm: 1646.42, matches: 120, utts: 256, avg loss: 1.0964, batches: 860
[TR] rank: 4, norm: 1646.42, matches: 131, utts: 256, avg loss: 1.0161, batches: 860
[TR] rank: 6, norm: 1637.52, matches: 133, utts: 256, avg loss: 0.9388, batches: 870
[TR] rank: 1, norm: 1637.52, matches: 140, utts: 256, avg loss: 0.9769, batches: 870
[TR] rank: 2, norm: 1637.52, matches: 130, utts: 256, avg loss: 1.0458, batches: 870
[TR] rank: 4, norm: 1637.52, matches: 138, utts: 256, avg loss: 0.9570, batches: 870
[TR] rank: 5, norm: 1637.52, matches: 118, utts: 256, avg loss: 1.0001, batches: 870
[TR] rank: 0, norm: 1637.52, matches: 109, utts: 256, avg loss: 1.0895, batches: 870
[TR] rank: 3, norm: 1637.52, matches: 128, utts: 256, avg loss: 1.0204, batches: 870
[TR] rank: 7, norm: 1637.52, matches: 134, utts: 256, avg loss: 0.9811, batches: 870
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** An error occurred in Socket closed
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** reported by process [1778253825,2]
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** on a NULL communicator
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** Unknown error
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[job-170078988023782404088-yihua-zhou-worker-1:00000] ***    and MPI will try to terminate your MPI job as well)
^@--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: job-170078988023782404088-yihua-zhou-master-0
  Local PID:  62
  Peer host:  job-170078988023782404088-yihua-zhou-worker-2
--------------------------------------------------------------------------
^@/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))
janjust commented 11 months ago

Hi @SimZhou , this could very be outside of OpenMPI, do you have any indication that this is specifically ompi v5.0 related? Does it work with older ompi versions?