openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 425 forks source link

UCT interface not thread safe on Cray #2388

Open hjelmn opened 6 years ago

hjelmn commented 6 years ago

I am working on a UCT btl for RMA in Open MPI. I have it working with single-threaded RMA tests but I am seeing various errors when running with two or more threads. I specified thread safety for the worker:

ucs_status = uct_worker_create (module->ucs_async, UCS_THREAD_MODE_MULTI, &module->uct_worker);

Putting locks around all UCT calls (uct_ep_put_zcopy, uct_worker_progress, etc) fixes the issue.

This is with UCX master branch configured with --enable-mt.

Example error:

[nid00014:54017:0:54017] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /lustre/ttscratch1/hjelmn/build/ucx_master/lib/libucs.so.0(+0x1d930) [0x2aaabdb2d930]
    1  /lustre/ttscratch1/hjelmn/build/ucx_master/lib/libucs.so.0(+0x1db74) [0x2aaabdb2db74]
===================
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00002aaabd6a491a in uct_ugni_progress (arg=0x8678b0) at ugni/rdma/ugni_rdma_iface.c:143
#2  0x00002aaabd223d4a in ucs_callbackq_dispatch (cbq=<optimized out>)
    at /lustre/ttscratch1/hjelmn/build/ucx_master/include/ucs/datastruct/callbackq.h:208
#3  uct_worker_progress (worker=<optimized out>) at /lustre/ttscratch1/hjelmn/build/ucx_master/include/uct/api/uct.h:1631
#4  mca_btl_ucx_flush (btl=0x866230, endpoint=0x0) at ../../../../../opal/mca/btl/ucx/btl_ucx_rdma.c:135
#5  0x00002aab4007849e in ompi_osc_rdma_sync_rdma_complete (sync=0x9ec430) at ../../../../../ompi/mca/osc/rdma/osc_rdma.h:587
#6  ompi_osc_rdma_flush (target=1, win=<optimized out>) at ../../../../../ompi/mca/osc/rdma/osc_rdma_passive_target.c:58
#7  0x00002aaaaaf98d51 in PMPI_Win_flush (rank=1, win=0x9dc6b0) at pwin_flush.c:59
#8  0x0000000000406b8c in bw_orig_flush (a=0x8685e0) at rmamt_bw.c:473
#9  0x0000000000403c5a in main (argc=10, argv=0x7fffffff5988) at rmamt_bw.c:217

Note this BTL is not intended to target ugni in the long run. It is intended for mlx5 but we do not have the appropriate software installed on our ConnectX 5 system at this time. Once that is installed I can test the same code there to see if the bug is in UCT common or ugni code.

yosefe commented 6 years ago

@hjelmn multi-threading is not currently supported by UCT layer

hjelmn commented 6 years ago

Ah, ok. The documentation says it supports it:

https://github.com/openucx/ucx/wiki/UCT-Design#thread-safety

hjelmn commented 6 years ago

BTW, the UCX benchmark is also failing with the same error:

mpirun -n 2 -N 1 --mca btl ^ucx ./ucx_perftest -t put_bw -x ugni_rdma -d ARIES:0 -T 32
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
[nid00014:7317 :0:7347] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /lustre/ttscratch1/hjelmn/ucx-git/src/ucs/.libs/libucs.so.0(+0x1d930) [0x2aaaab853930]
    1  /lustre/ttscratch1/hjelmn/ucx-git/src/ucs/.libs/libucs.so.0(+0x1db74) [0x2aaaab853b74]
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 7317 on node nid00014 exited on signal 11 (Segmentation fault).
hjelmn commented 6 years ago

Also, I have to point out that if UCT is not thread safe there is no reason the ugni tl should be locking at all.