Open tjcw opened 5 years ago
I now have access to x86 servers. This test still fails with status -3 (I/O error), but IMB gets further before the error happens. The error occurs repeatably when IMB tries an 'Allreduce' for 2048 bytes; this may be where the collectives library starts using a different algorithm for the allreduce, which could cause new ranks to start communicating (and therefore ucx wireup processing). It would be impractical to turn on ucx trace for the whole IMB run, but I may be able to reduce the run and turn trace on. This is running 144 ranks on 9 nodes, using openucx 'master' at commit 4042503042ac9132bb505dd675d93b506eb868f7 . Running a subset of IMB and with ucx trace on gives 2 failing ranks (0 and 16, one each on the first 2 nodes), the last few lines of trace on each failing rank are
[1571140750.742561] [f4n03:14620:0] mpool.inl:81 UCX TRACE Returning obj=0x7fc0f97a9308 to mpool=0x361f708
[1571140750.742567] [f4n03:14620:0] mpool.inl:82 UCX TRACE Returning obj=0x7fc0f97a9308 to mpool name=rc_send_desc
[1571140750.742572] [f4n03:14620:0] mpool.inl:81 UCX TRACE Returning obj=0x7fc0f97a72c8 to mpool=0x361f708
[1571140750.742577] [f4n03:14620:0] mpool.inl:82 UCX TRACE Returning obj=0x7fc0f97a72c8 to mpool name=rc_send_desc
[1571140750.742583] [f4n03:14620:0] mpool.inl:81 UCX TRACE Returning obj=0x7fc0f97a5288 to mpool=0x361f708
[1571140750.742588] [f4n03:14620:0] mpool.inl:82 UCX TRACE Returning obj=0x7fc0f97a5288 to mpool name=rc_send_desc
[1571140750.749725] [f4n03:14620:0] ib_mlx5_log.c:139 UCX DEBUG Local protection on mlx5_0:1/IB (synd 0x4 vend 0x56 hw_synd 0/0)
[1571140750.749725] [f4n03:14620:0] ib_mlx5_log.c:139 UCX DEBUG DCI QP 0x2763f wqe[13]: opcode SEND
[1571140750.749760] [f4n03:14620:0] ucp_request.c:195 UCX REQ req 0x37cfac0: mem dereg buffer 0/2 md_map 0x20
[1571140750.749765] [f4n03:14620:0] ucp_mm.c:71 UCX TRACE de-registering memh[0]=0x358e158 from md[5]
[1571140750.749772] [f4n03:14620:0] ucp_request.c:195 UCX REQ req 0x37cfac0: mem dereg buffer 1/2 md_map 0x20
[1571140750.749776] [f4n03:14620:0] ucp_mm.c:71 UCX TRACE de-registering memh[0]=0x358e158 from md[5]
[1571140750.749787] [f4n03:14620:0] ucp_request.inl:96 UCX REQ completing send request 0x37cfac0 (0x37cfba8) ----c-- Input/output error
[1571140750.749791] [f4n03:14620:0] ucp_request.inl:98 UCX REQ Driving callback
[1571140750.749808] [f4n04:29729:0] mpool.inl:82 UCX TRACE Returning obj=0x7f04041f5c88 to mpool name=rc_send_desc
[1571140750.749814] [f4n04:29729:0] mpool.inl:81 UCX TRACE Returning obj=0x7f04041e5a88 to mpool=0x23e5448
[1571140750.749818] [f4n04:29729:0] mpool.inl:82 UCX TRACE Returning obj=0x7f04041e5a88 to mpool name=rc_send_desc
[1571140750.749838] [f4n04:29729:0] ucp_am.c:611 UCX REQ allocated request 0x252e5c0
[1571140750.749855] [f4n04:29729:0] ucp_mm.c:124 UCX TRACE registered address 0x25547f0 length 32 on md[5] memh[0]=0x234cb98
[1571140750.749859] [f4n04:29729:0] ucp_request.c:252 UCX REQ req 0x252e5c0: mem reg iov 0/2 md_map 0x20/0x20
[1571140750.749862] [f4n04:29729:0] ucp_mm.c:124 UCX TRACE registered address 0x254f900 length 2048 on md[5] memh[0]=0x234cb98
[1571140750.749865] [f4n04:29729:0] ucp_request.c:252 UCX REQ req 0x252e5c0: mem reg iov 1/2 md_map 0x20/0x20
[1571140750.758098] [f4n04:29729:0] ib_mlx5_log.c:139 UCX DEBUG Local protection on mlx5_0:1/IB (synd 0x4 vend 0x56 hw_synd 0/0)
[1571140750.758098] [f4n04:29729:0] ib_mlx5_log.c:139 UCX DEBUG DCI QP 0x9a95 wqe[11]: opcode SEND
[1571140750.758116] [f4n04:29729:0] ucp_request.c:195 UCX REQ req 0x252e5c0: mem dereg buffer 0/2 md_map 0x20
[1571140750.758120] [f4n04:29729:0] ucp_mm.c:71 UCX TRACE de-registering memh[0]=0x234cb98 from md[5]
[1571140750.758126] [f4n04:29729:0] ucp_request.c:195 UCX REQ req 0x252e5c0: mem dereg buffer 1/2 md_map 0x20
[1571140750.758129] [f4n04:29729:0] ucp_mm.c:71 UCX TRACE de-registering memh[0]=0x234cb98 from md[5]
[1571140750.758133] [f4n04:29729:0] ucp_request.inl:96 UCX REQ completing send request 0x252e5c0 (0x252e6a8) ----c-- Input/output error
[1571140750.758136] [f4n04:29729:0] ucp_request.inl:98 UCX REQ Driving callback
I am using ucx built from 'master' 6110727f8de09703e0f4cde4977fa7e22f0349c5 . I have a test case using a development version of IBM Spectrum MPI (based on OpenMPI), and Intel MPI Benchmarks. The development version of IBM Spectrum MPI communicates using ucp active messages. When I run this on 120 ranks (6 nodes at 20 ranks/node), I sometimes get the send completion callback of a ucp_am_send_nb call driven with status=-3 (I/O Error). This happens early in the IMB run, I think while ucx is wiring up endpoints on behalf of the application. Here are the last few lines from ucx trace in the failing rank
Here ucx_c712f6n06_37273.log is the whole ucx trace from the failing rank. I think the application was trying to send the same message buffer to several ranks, i.e. there were several ucp_am_send_nb calls with 'iovec's referring to the same data in this rank. Some of the active messages succeed, but the application asserts when it gets the first status which is not UCS_OK. The test case is running over IB on IBM POWER8, with Mellanox ConnectX-5 IB adapters.