openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 424 forks source link

ucp active messages give I/O error under load #4268

Open tjcw opened 5 years ago

tjcw commented 5 years ago

I am using ucx built from 'master' 6110727f8de09703e0f4cde4977fa7e22f0349c5 . I have a test case using a development version of IBM Spectrum MPI (based on OpenMPI), and Intel MPI Benchmarks. The development version of IBM Spectrum MPI communicates using ucp active messages. When I run this on 120 ranks (6 nodes at 20 ranks/node), I sometimes get the send completion callback of a ucp_am_send_nb call driven with status=-3 (I/O Error). This happens early in the IMB run, I think while ucx is wiring up endpoints on behalf of the application. Here are the last few lines from ucx trace in the failing rank

[1570713047.800167] [c712f6n06:37273:0]         select.c:205  UCX  TRACE dc_mlx5/mlx5_0:1 : not suitable for auxiliary, no async callback
[1570713047.800171] [c712f6n06:37273:0]         wireup.c:254  UCX  TRACE ep 0x3fff9c090380: remote connected
[1570713047.800174] [c712f6n06:37273:0]      wireup_ep.c:658  UCX  TRACE ep 0x3fff9c090380: wireup ep 0x10041249730 is remote-connected
[1570713047.800178] [c712f6n06:37273:0]         wireup.c:464  UCX  TRACE ep 0x3fff9c090380: sending wireup reply
[1570713047.800183] [c712f6n06:37273:0]      wireup_ep.c:77   UCX  TRACE ep 0x3fff9c090380: switching wireup_ep 0x10041249730 to ready state
[1570713047.800186] [c712f6n06:37273:0]      wireup_ep.c:376  UCX  DEBUG ep 0x3fff9c090380: destroy wireup ep 0x10041249730
[1570713047.804700] [c712f6n06:37273:0]    ib_mlx5_log.c:139  UCX  DEBUG Local protection on mlx5_0:1/IB (synd 0x4 vend 0x56 hw_synd 0/0)
[1570713047.804700] [c712f6n06:37273:0]    ib_mlx5_log.c:139  UCX  DEBUG DCI QP 0x10ea2 wqe[7]: opcode SEND
[1570713047.804722] [c712f6n06:37273:0]    ucp_request.c:195  UCX  REQ   req 0x10041241e80: mem dereg buffer 0/2 md_map 0x40
[1570713047.804726] [c712f6n06:37273:0]         ucp_mm.c:71   UCX  TRACE de-registering memh[0]=0x10040f2da28 from md[6]
[1570713047.804732] [c712f6n06:37273:0]    ucp_request.c:195  UCX  REQ   req 0x10041241e80: mem dereg buffer 1/2 md_map 0x40
[1570713047.804735] [c712f6n06:37273:0]         ucp_mm.c:71   UCX  TRACE de-registering memh[0]=0x10040f2da28 from md[6]
[1570713047.804740] [c712f6n06:37273:0]  ucp_request.inl:95   UCX  REQ   completing send request 0x10041241e80 (0x10041241f68) ----c-- Input/output error

Here ucx_c712f6n06_37273.log is the whole ucx trace from the failing rank. I think the application was trying to send the same message buffer to several ranks, i.e. there were several ucp_am_send_nb calls with 'iovec's referring to the same data in this rank. Some of the active messages succeed, but the application asserts when it gets the first status which is not UCS_OK. The test case is running over IB on IBM POWER8, with Mellanox ConnectX-5 IB adapters.

tjcw commented 5 years ago

I now have access to x86 servers. This test still fails with status -3 (I/O error), but IMB gets further before the error happens. The error occurs repeatably when IMB tries an 'Allreduce' for 2048 bytes; this may be where the collectives library starts using a different algorithm for the allreduce, which could cause new ranks to start communicating (and therefore ucx wireup processing). It would be impractical to turn on ucx trace for the whole IMB run, but I may be able to reduce the run and turn trace on. This is running 144 ranks on 9 nodes, using openucx 'master' at commit 4042503042ac9132bb505dd675d93b506eb868f7 . Running a subset of IMB and with ucx trace on gives 2 failing ranks (0 and 16, one each on the first 2 nodes), the last few lines of trace on each failing rank are

[1571140750.742561] [f4n03:14620:0]        mpool.inl:81   UCX  TRACE Returning obj=0x7fc0f97a9308 to mpool=0x361f708
[1571140750.742567] [f4n03:14620:0]        mpool.inl:82   UCX  TRACE Returning obj=0x7fc0f97a9308 to mpool name=rc_send_desc
[1571140750.742572] [f4n03:14620:0]        mpool.inl:81   UCX  TRACE Returning obj=0x7fc0f97a72c8 to mpool=0x361f708
[1571140750.742577] [f4n03:14620:0]        mpool.inl:82   UCX  TRACE Returning obj=0x7fc0f97a72c8 to mpool name=rc_send_desc
[1571140750.742583] [f4n03:14620:0]        mpool.inl:81   UCX  TRACE Returning obj=0x7fc0f97a5288 to mpool=0x361f708
[1571140750.742588] [f4n03:14620:0]        mpool.inl:82   UCX  TRACE Returning obj=0x7fc0f97a5288 to mpool name=rc_send_desc
[1571140750.749725] [f4n03:14620:0]    ib_mlx5_log.c:139  UCX  DEBUG Local protection on mlx5_0:1/IB (synd 0x4 vend 0x56 hw_synd 0/0)
[1571140750.749725] [f4n03:14620:0]    ib_mlx5_log.c:139  UCX  DEBUG DCI QP 0x2763f wqe[13]: opcode SEND
[1571140750.749760] [f4n03:14620:0]    ucp_request.c:195  UCX  REQ   req 0x37cfac0: mem dereg buffer 0/2 md_map 0x20
[1571140750.749765] [f4n03:14620:0]         ucp_mm.c:71   UCX  TRACE de-registering memh[0]=0x358e158 from md[5]
[1571140750.749772] [f4n03:14620:0]    ucp_request.c:195  UCX  REQ   req 0x37cfac0: mem dereg buffer 1/2 md_map 0x20
[1571140750.749776] [f4n03:14620:0]         ucp_mm.c:71   UCX  TRACE de-registering memh[0]=0x358e158 from md[5]
[1571140750.749787] [f4n03:14620:0]  ucp_request.inl:96   UCX  REQ   completing send request 0x37cfac0 (0x37cfba8) ----c-- Input/output error
[1571140750.749791] [f4n03:14620:0]  ucp_request.inl:98   UCX  REQ   Driving callback
[1571140750.749808] [f4n04:29729:0]        mpool.inl:82   UCX  TRACE Returning obj=0x7f04041f5c88 to mpool name=rc_send_desc
[1571140750.749814] [f4n04:29729:0]        mpool.inl:81   UCX  TRACE Returning obj=0x7f04041e5a88 to mpool=0x23e5448
[1571140750.749818] [f4n04:29729:0]        mpool.inl:82   UCX  TRACE Returning obj=0x7f04041e5a88 to mpool name=rc_send_desc
[1571140750.749838] [f4n04:29729:0]         ucp_am.c:611  UCX  REQ   allocated request 0x252e5c0
[1571140750.749855] [f4n04:29729:0]         ucp_mm.c:124  UCX  TRACE registered address 0x25547f0 length 32 on md[5] memh[0]=0x234cb98
[1571140750.749859] [f4n04:29729:0]    ucp_request.c:252  UCX  REQ   req 0x252e5c0: mem reg iov 0/2 md_map 0x20/0x20
[1571140750.749862] [f4n04:29729:0]         ucp_mm.c:124  UCX  TRACE registered address 0x254f900 length 2048 on md[5] memh[0]=0x234cb98
[1571140750.749865] [f4n04:29729:0]    ucp_request.c:252  UCX  REQ   req 0x252e5c0: mem reg iov 1/2 md_map 0x20/0x20
[1571140750.758098] [f4n04:29729:0]    ib_mlx5_log.c:139  UCX  DEBUG Local protection on mlx5_0:1/IB (synd 0x4 vend 0x56 hw_synd 0/0)
[1571140750.758098] [f4n04:29729:0]    ib_mlx5_log.c:139  UCX  DEBUG DCI QP 0x9a95 wqe[11]: opcode SEND
[1571140750.758116] [f4n04:29729:0]    ucp_request.c:195  UCX  REQ   req 0x252e5c0: mem dereg buffer 0/2 md_map 0x20
[1571140750.758120] [f4n04:29729:0]         ucp_mm.c:71   UCX  TRACE de-registering memh[0]=0x234cb98 from md[5]
[1571140750.758126] [f4n04:29729:0]    ucp_request.c:195  UCX  REQ   req 0x252e5c0: mem dereg buffer 1/2 md_map 0x20
[1571140750.758129] [f4n04:29729:0]         ucp_mm.c:71   UCX  TRACE de-registering memh[0]=0x234cb98 from md[5]
[1571140750.758133] [f4n04:29729:0]  ucp_request.inl:96   UCX  REQ   completing send request 0x252e5c0 (0x252e6a8) ----c-- Input/output error
[1571140750.758136] [f4n04:29729:0]  ucp_request.inl:98   UCX  REQ   Driving callback