OMPI-1.7 MPI_Alltoallv hang

beginZero commented 8 years ago

I encounter an indeterministic hang when running the OSU benchmark osu_alltoallv in following way:

mpirun -mca coll_basic_priority 100 -mca btl tcp,self ./osu_alltoallv -m SIZE.

I run it on 3 computing nodes (32 cores/node), with each node having 32 processes. The hang happens around 512KB message size. OMPI emits following error messages when the hang happens:

1 Mon May  2 20:12:53 CDT 2016
  2 
  3 # OSU MPI All-to-Allv Personalized Exchange Latency Test
  4 Number of processes: 96
  5 Message size: 512.000000KB
  6 [node12][[17490,1],60][btl_tcp_endpoint.c:457:mca_btl_tcp_endpoint_recv_blocking] recv(31) failed: Connection reset by peer (104)
  7 --------------------------------------------------------------------------
  8 Sorry!  You were supposed to get help about:
  9     client handshake fail
 10 from the file:
 11     help-mpi-btl-tcp.txt
 12 But I couldn't find that topic in the file.  Sorry!
 13 --------------------------------------------------------------------------
 14 [node12][[17490,1],36][btl_tcp_endpoint.c:457:mca_btl_tcp_endpoint_recv_blocking] recv(92) failed: Connection reset by peer (104)
 15 [node12][[17490,1],34][btl_tcp_endpoint.c:457:mca_btl_tcp_endpoint_recv_blocking] recv(103) failed: Connection reset by peer (104)
 16 [node12][[17490,1],45][btl_tcp_endpoint.c:457:mca_btl_tcp_endpoint_recv_blocking] recv(95) failed: Connection reset by peer (104)
 17 [node12][[17490,1],58][btl_tcp_endpoint.c:457:mca_btl_tcp_endpoint_recv_blocking] recv(82) failed: Connection reset by peer (104)
 18 [node12][[17490,1],37][btl_tcp_endpoint.c:457:mca_btl_tcp_endpoint_recv_blocking] recv(83) failed: Connection reset by peer (104)
 19 [node12][[17490,1],44][btl_tcp_endpoint.c:457:mca_btl_tcp_endpoint_recv_blocking] recv(83) failed: Connection reset by peer (104)
 20 [node12][[17490,1],32][btl_tcp_endpoint.c:457:mca_btl_tcp_endpoint_recv_blocking] recv(83) failed: Connection reset by peer (104)
 21 --------------------------------------------------------------------------
 22 Sorry!  You were supposed to get help about:
 23     client handshake fail
 24 from the file:
 25     help-mpi-btl-tcp.txt
 26 But I couldn't find that topic in the file.  Sorry!
 27 --------------------------------------------------------------------------
 28 --------------------------------------------------------------------------
 29 Sorry!  You were supposed to get help about:
 30     client handshake fail
 31 from the file:
 32     help-mpi-btl-tcp.txt
 33 But I couldn't find that topic in the file.  Sorry!
 34 --------------------------------------------------------------------------
 35 [node12][[17490,1],34][btl_tcp_endpoint.c:457:mca_btl_tcp_endpoint_recv_blocking] recv(102) failed: Connection reset by peer (104)
...... (so on and so on as above)

The homepage of OMPI mentioned that this RST error may be caused by an abnormally terminated process, but we have verified that this is not the root cause in our case. We guess that this might be brought about by the traffic burst caused by the selected basic alltoallv algorithm.

EDIT: Added verbatim block

jsquyres commented 8 years ago

Can you try upgrading to the latest Open MPI v1.10.x (as of this writing, 1.10.2) and see if the problem fixes itself?

beginZero commented 8 years ago

Thanks for the reply:)

I tried OMPI-1.10.0 and run in the same style. It failed again yet the effect is a crash rather than a hang. The error message is given below:

 hongbo @ 2****************************************************************
 25 Fri May  6 12:52:08 CDT 2016
 26 
 27 # OSU MPI All-to-Allv Personalized Exchange Latency Test
 28 Number of processes: 96
 29 Message size: 256.000000KB
 30 [node10][[27278,1],18][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(107, 0/8) failed: Connection reset by peer (104)
 31 [node10.cluster:26565] pml_ob1_sendreq.c:232 FATAL
 32 [node10][[27278,1],16][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(108, 0/8) failed: Connection reset by peer (104)
 33 [node10.cluster:26563] pml_ob1_sendreq.c:232 FATAL
 34 [node11][[27278,1],44][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(93, 0/8) failed: Connection reset by peer (104)
 35 [node11.cluster:02793] pml_ob1_sendreq.c:232 FATAL
 36 [node12][[27278,1],78][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(197, 0/8) failed: Connection reset by peer (104)
 37 -------------------------------------------------------
 38 Primary job  terminated normally, but 1 process returned
 39 a non-zero exit code.. Per user-direction, the job has been aborted.
 40 -------------------------------------------------------
 41 [node11][[27278,1],45][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
 42 [node10][[27278,1],25][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
 43 [node10.cluster:26581] [[27278,1],25] ORTE_ERROR_LOG: Unreachable in file pml_ob1_sendreq.c at line 1130
 44 [node10][[27278,1],28][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
 45 [node10][[27278,1],28][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
 46 [node10][[27278,1],28][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
 47 [node10][[27278,1],30][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(56, 0/8) failed: Connection reset by peer (104)
 48 [node10.cluster:26591] pml_ob1_sendreq.c:232 FATAL
 49 [node10.cluster:26592] [[27278,1],31] ORTE_ERROR_LOG: Unreachable in file pml_ob1_sendreq.c at line 1130
 50 [node10][[27278,1],6][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(57, 0/8) failed: Connection reset by peer (104)
 51 [node10.cluster:26542] pml_ob1_sendreq.c:232 FATAL
 52 [node10][[27278,1],8][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
 53 [node10][[27278,1],8][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
 54 [node10][[27278,1],13][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.12 failed: Connection refused (111)
 55 [node10][[27278,1],14][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.12 failed: Connection refused (111)
 56 [node10][[27278,1],14][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
 57 [node10][[27278,1],22][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
 58 [node10][[27278,1],22][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
 59 [node10][[27278,1],22][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
 60 [node10][[27278,1],23][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
 61 [node12][[27278,1],67][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
 62 [node11][[27278,1],32][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
 63 [node12][[27278,1],70][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] [node11][[27278,1],40][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] [node12][[27278,1],74][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset     by peer (104)
 64 [node11][[27278,1],45][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
...(so on and so on)...

**EDIT: Added verbatim markdown block.***

bosilca commented 8 years ago

Because the error is triggered in mca_btl_tcp_endpoint_recv_blocking, this indicates that some connection is being established. As your original post claims that the alltoall was ongoing for quite some time (because you reached 512k), it is clear that is a working connection going on, and that the establishment of the second connection fails. This raises several questions:

How many interfaces are available on your cluster ?
Are they all in a working state ?
What is happening if you restrict Open MPI to a single interface (eth0) ?

beginZero commented 8 years ago

Thanks for the suggestion. I will double check the above questions.

beginZero commented 8 years ago

We tried the suggested setting, but the bug persists. Maybe this is related to some environment setting?

Each node has two network interfaces: one ethernet interface (eth0) and one ib interface (ib0).
Yes, they are all in a working state.
We tried to run with only one interface, i.e. either eth0 or ib0, but the bug is still there.

jsquyres commented 8 years ago

Is there any chance you can upgrade to the latest Open MPI v1.10.x series?

I ask because both the v1.7 and v1.8 series are frozen / done -- bug fixes are being applied to the v1.10 series.

beginZero commented 8 years ago

Thanks for the reply. We have tested OMPI-1.10.0 actually. It is crash though as mentioned earlier in this post.

Anyway, we can try much higher version:)

beginZero commented 8 years ago

Sorry for the late reply. We just tested it with the latest version.

We have even tested the latest version 1.10.2 as well as 1.10.0 and 1.7. The bug still occurs. Most importantly, we happened to find the triggering case for this type of bug is interesting. We use the osu_alltoallv from OSU benchmark suite to perform the testing. This benchmark will iteratively test message size from 1 bytes, 2 bytes, 4 bytes, ..., 2^n bytes, ...until the specified upper-bound size.

If we just start testing from size 512 Kilobytes , it fails almost deterministically. But if we start from 1byte, it can successfully pass the test on 512 Kilobytes.

That's to say, this bug doesn't happen if we increment the message size gradually. It is very wierd. Any ideas? Thanks in advance.

beginZero commented 8 years ago

Continue...

If we just start testing from size 512 Kilobytes , it fails almost deterministically. But if we start from 1byte, it can successfully pass the test on 512 Kilobytes and even goes to several megabytes without any problems.

bosilca commented 8 years ago

The only way to replicate the deadlock is to add a non-connected network interface. As you state that you have tried to restrict Open MPI to a single interface (I assume using --mca btl_tcp_if_include) I am running out of ideas regarding the reason of such a deadlock.

beginZero commented 8 years ago

Wow... It is my bad. When I try to restrict communication to a single interface, I write btl_tcp_if_include as btl_if_include. Thanks for reminding me that.

I test it in the correct way then. It works great if we fix it to ib0, but it fails when we fix it to eth0. And it works both with 1.7 and 1.10.0.

It is strange how the failure with eth0 relates to the message size: when the message size grows to 256KB, it fails with a smaller probability; when the message size grows to 512KB, it fails with a bigger probability.

open-mpi / ompi

OMPI-1.7 MPI_Alltoallv hang #1620