Open beginZero opened 8 years ago
Can you try upgrading to the latest Open MPI v1.10.x (as of this writing, 1.10.2) and see if the problem fixes itself?
Thanks for the reply:)
I tried OMPI-1.10.0 and run in the same style. It failed again yet the effect is a crash rather than a hang. The error message is given below:
hongbo @ 2****************************************************************
25 Fri May 6 12:52:08 CDT 2016
26
27 # OSU MPI All-to-Allv Personalized Exchange Latency Test
28 Number of processes: 96
29 Message size: 256.000000KB
30 [node10][[27278,1],18][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(107, 0/8) failed: Connection reset by peer (104)
31 [node10.cluster:26565] pml_ob1_sendreq.c:232 FATAL
32 [node10][[27278,1],16][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(108, 0/8) failed: Connection reset by peer (104)
33 [node10.cluster:26563] pml_ob1_sendreq.c:232 FATAL
34 [node11][[27278,1],44][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(93, 0/8) failed: Connection reset by peer (104)
35 [node11.cluster:02793] pml_ob1_sendreq.c:232 FATAL
36 [node12][[27278,1],78][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(197, 0/8) failed: Connection reset by peer (104)
37 -------------------------------------------------------
38 Primary job terminated normally, but 1 process returned
39 a non-zero exit code.. Per user-direction, the job has been aborted.
40 -------------------------------------------------------
41 [node11][[27278,1],45][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
42 [node10][[27278,1],25][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
43 [node10.cluster:26581] [[27278,1],25] ORTE_ERROR_LOG: Unreachable in file pml_ob1_sendreq.c at line 1130
44 [node10][[27278,1],28][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
45 [node10][[27278,1],28][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
46 [node10][[27278,1],28][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
47 [node10][[27278,1],30][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(56, 0/8) failed: Connection reset by peer (104)
48 [node10.cluster:26591] pml_ob1_sendreq.c:232 FATAL
49 [node10.cluster:26592] [[27278,1],31] ORTE_ERROR_LOG: Unreachable in file pml_ob1_sendreq.c at line 1130
50 [node10][[27278,1],6][btl_tcp_endpoint.c:618:mca_btl_tcp_endpoint_recv_blocking] recv(57, 0/8) failed: Connection reset by peer (104)
51 [node10.cluster:26542] pml_ob1_sendreq.c:232 FATAL
52 [node10][[27278,1],8][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
53 [node10][[27278,1],8][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
54 [node10][[27278,1],13][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.12 failed: Connection refused (111)
55 [node10][[27278,1],14][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.12 failed: Connection refused (111)
56 [node10][[27278,1],14][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
57 [node10][[27278,1],22][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
58 [node10][[27278,1],22][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
59 [node10][[27278,1],22][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
60 [node10][[27278,1],23][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.3.11 failed: Connection refused (111)
61 [node12][[27278,1],67][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
62 [node11][[27278,1],32][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
63 [node12][[27278,1],70][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] [node11][[27278,1],40][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] [node12][[27278,1],74][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
64 [node11][[27278,1],45][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
...(so on and so on)...
**EDIT: Added verbatim markdown block.***
Because the error is triggered in mca_btl_tcp_endpoint_recv_blocking, this indicates that some connection is being established. As your original post claims that the alltoall was ongoing for quite some time (because you reached 512k), it is clear that is a working connection going on, and that the establishment of the second connection fails. This raises several questions:
Thanks for the suggestion. I will double check the above questions.
We tried the suggested setting, but the bug persists. Maybe this is related to some environment setting?
Is there any chance you can upgrade to the latest Open MPI v1.10.x series?
I ask because both the v1.7 and v1.8 series are frozen / done -- bug fixes are being applied to the v1.10 series.
Thanks for the reply. We have tested OMPI-1.10.0 actually. It is crash though as mentioned earlier in this post.
Anyway, we can try much higher version:)
Sorry for the late reply. We just tested it with the latest version.
We have even tested the latest version 1.10.2 as well as 1.10.0 and 1.7. The bug still occurs. Most importantly, we happened to find the triggering case for this type of bug is interesting. We use the osu_alltoallv from OSU benchmark suite to perform the testing. This benchmark will iteratively test message size from 1 bytes, 2 bytes, 4 bytes, ..., 2^n bytes, ...until the specified upper-bound size.
If we just start testing from size 512 Kilobytes , it fails almost deterministically. But if we start from 1byte, it can successfully pass the test on 512 Kilobytes.
That's to say, this bug doesn't happen if we increment the message size gradually. It is very wierd. Any ideas? Thanks in advance.
Continue...
If we just start testing from size 512 Kilobytes , it fails almost deterministically. But if we start from 1byte, it can successfully pass the test on 512 Kilobytes and even goes to several megabytes without any problems.
The only way to replicate the deadlock is to add a non-connected network interface. As you state that you have tried to restrict Open MPI to a single interface (I assume using --mca btl_tcp_if_include) I am running out of ideas regarding the reason of such a deadlock.
Wow... It is my bad. When I try to restrict communication to a single interface, I write btl_tcp_if_include as btl_if_include. Thanks for reminding me that.
I test it in the correct way then. It works great if we fix it to ib0, but it fails when we fix it to eth0. And it works both with 1.7 and 1.10.0.
It is strange how the failure with eth0 relates to the message size: when the message size grows to 256KB, it fails with a smaller probability; when the message size grows to 512KB, it fails with a bigger probability.
I encounter an indeterministic hang when running the OSU benchmark osu_alltoallv in following way:
mpirun -mca coll_basic_priority 100 -mca btl tcp,self ./osu_alltoallv -m SIZE.
I run it on 3 computing nodes (32 cores/node), with each node having 32 processes. The hang happens around 512KB message size. OMPI emits following error messages when the hang happens:
The homepage of OMPI mentioned that this RST error may be caused by an abnormally terminated process, but we have verified that this is not the root cause in our case. We guess that this might be brought about by the traffic burst caused by the selected basic alltoallv algorithm.
EDIT: Added verbatim block