I train the data on cluster with five workers, the nodes 2.com is the master worker. the error information like this.
Fatal error in PMPI_Test: A process has failed, error stack:
PMPI_Test(166).............: MPI_Test(request=0x73f750, flag=0x7fbd36633840, status=0x7fbd36633850) failed
MPIR_Test_impl(65).........:
MPID_nem_tcp_connpoll(1826): Communication error with rank 1: Connection reset by peer
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@3.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:2@3.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2@3.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:3@4.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:3@4.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:3@4.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:0@0.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0@0.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@0.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:4@5.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:4@5.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:4@5.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@2.com] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@2.com] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@2.com] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@2.com] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
I train the data on cluster with five workers, the nodes 2.com is the master worker. the error information like this.