Fatal error in PMPI_Test: A process has failed, error stack:

1234clam commented 6 years ago

I train the data on cluster with five workers, the nodes 2.com is the master worker. the error information like this.

Fatal error in PMPI_Test: A process has failed, error stack:
PMPI_Test(166).............: MPI_Test(request=0x73f750, flag=0x7fbd36633840, status=0x7fbd36633850) failed
MPIR_Test_impl(65).........:
MPID_nem_tcp_connpoll(1826): Communication error with rank 1: Connection reset by peer

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@3.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:2@3.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2@3.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:3@4.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:3@4.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:3@4.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:0@0.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0@0.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@0.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:4@5.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:4@5.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:4@5.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@2.com] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@2.com] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@2.com] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@2.com] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion

SilentCC commented 5 years ago

I met the same problem.

Abigale001 commented 5 years ago

Have you solved it?

microsoft / LightLDA

Fatal error in PMPI_Test: A process has failed, error stack: #62