Open hjelmn opened 6 years ago
@hjelmn multi-threading is not currently supported by UCT layer
Ah, ok. The documentation says it supports it:
https://github.com/openucx/ucx/wiki/UCT-Design#thread-safety
BTW, the UCX benchmark is also failing with the same error:
mpirun -n 2 -N 1 --mca btl ^ucx ./ucx_perftest -t put_bw -x ugni_rdma -d ARIES:0 -T 32
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
[nid00014:7317 :0:7347] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
0 /lustre/ttscratch1/hjelmn/ucx-git/src/ucs/.libs/libucs.so.0(+0x1d930) [0x2aaaab853930]
1 /lustre/ttscratch1/hjelmn/ucx-git/src/ucs/.libs/libucs.so.0(+0x1db74) [0x2aaaab853b74]
===================
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 7317 on node nid00014 exited on signal 11 (Segmentation fault).
Also, I have to point out that if UCT is not thread safe there is no reason the ugni tl should be locking at all.
I am working on a UCT btl for RMA in Open MPI. I have it working with single-threaded RMA tests but I am seeing various errors when running with two or more threads. I specified thread safety for the worker:
Putting locks around all UCT calls (uct_ep_put_zcopy, uct_worker_progress, etc) fixes the issue.
This is with UCX master branch configured with --enable-mt.
Example error:
Note this BTL is not intended to target ugni in the long run. It is intended for mlx5 but we do not have the appropriate software installed on our ConnectX 5 system at this time. Once that is installed I can test the same code there to see if the bug is in UCT common or ugni code.