Open devreal opened 3 years ago
can you pls try setting -mca pml_ucx_progress_iterations 1
?
I will rerun the experiments with pml_ucx_progress_iterations 1
but I don't see how that can help with the long calls into ucp_put_nbi
since there is only one thread calling into MPI and the puts don't require anything else to progress. Will report back :)
potentially, put_nbi takes longer time because of internal UCX requests memory allocation (also you would see mem usage grows) - because osc/ucx does not progress frequently enough to release completed put_nbi operations. pml_ucx_progress_iterations 1
should make pml/ucx yield more progress cycles for osc/ucx
Sorry for the delay, I had to do some more experiments. First, setting pml_ucx_progress_iterations 1
does not make a difference, the results are the same.
I did run with UCX profiling enabled and had a hard time interpreting the information. Unfortunately, the aggregated information does not provide maximums, only average, which makes it hard to find any outliers. I also ran with logging enabled. Unfortunately, I'm not sure how to interpret the output of ucx_read_prof
for log profiles. Assuming that the number following the function name is the runtime in milliseconds, I see several high runtimes between 4 and 6ms:
0.013 uct_ep_put_zcopy 6.702 { rma_basic.c:63 ucp_rma_basic_progress_put()
6.702 }
This is the maximum I could find in the traces. Unfortunately, it does not say what is happening there.
Is my interpretation of the profiling output correct here? Any chance I could get more detailed information out of a run? I don't mind adding additional instrumentation points but I'm not sure where to put them :)
Yes, the output interpretation is correct. What UCT transport(s) are being used (can add UCX_LOG_LEVEL=info to print them). I would not expect it will be so long for an IB transport since it doesn't wait for anything
Here is the output of UCX_LOG_LEVEL=info
with PaRSEC on 4 nodes. Anything that strikes you?
[1624984398.083715] [r39c2t5n1:22453:0] ucp_worker.c:1720 UCX INFO ep_cfg[0]: tag(self/memory rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.083927] [r39c2t5n3:22452:0] ucp_worker.c:1720 UCX INFO ep_cfg[0]: tag(self/memory rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.083977] [r39c2t5n4:22213:0] ucp_worker.c:1720 UCX INFO ep_cfg[0]: tag(self/memory rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.083983] [r39c2t5n2:22162:0] ucp_worker.c:1720 UCX INFO ep_cfg[0]: tag(self/memory rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.153330] [r39c2t5n4:22213:0] ucp_worker.c:1720 UCX INFO ep_cfg[1]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.154154] [r39c2t5n2:22162:0] ucp_worker.c:1720 UCX INFO ep_cfg[1]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.156997] [r39c2t5n3:22452:1] ucp_worker.c:1720 UCX INFO ep_cfg[1]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.157904] [r39c2t5n1:22453:1] ucp_worker.c:1720 UCX INFO ep_cfg[1]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.162001] [r39c2t5n3:22452:0] ucp_worker.c:1720 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.171165] [r39c2t5n4:22213:1] ucp_worker.c:1720 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.254931] [r39c2t5n1:22453:2] ucp_worker.c:1720 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_0:1 rc_mlx5/mlx5_1:1);
[1624984398.264687] [r39c2t5n1:22453:2] ucp_worker.c:1720 UCX INFO ep_cfg[0]: rma(self/memory posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984398.264696] [r39c2t5n4:22213:2] ucp_worker.c:1720 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984398.264683] [r39c2t5n2:22162:1] ucp_worker.c:1720 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984398.264686] [r39c2t5n3:22452:2] ucp_worker.c:1720 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984398.273323] [r39c2t5n1:22453:2] ucp_worker.c:1720 UCX INFO ep_cfg[1]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984398.279303] [r39c2t5n2:22162:1] ucp_worker.c:1720 UCX INFO ep_cfg[1]: rma(self/memory posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984398.280913] [r39c2t5n2:22162:1] ucp_worker.c:1720 UCX INFO ep_cfg[2]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
#+++++ cores detected : 32
#+++++ nodes x cores + gpu : 4 x 32 + 0 (128+0)
#+++++ thread mode : THREAD_SERIALIZED
#+++++ P x Q : 2 x 2 (4/4)
#+++++ M x N x K|NRHS : 30720 x 30720 x 1
#+++++ MB x NB : 128 x 128
[1624984398.284071] [r39c2t5n3:22452:2] ucp_worker.c:1720 UCX INFO ep_cfg[1]: rma(self/memory posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984398.285763] [r39c2t5n3:22452:2] ucp_worker.c:1720 UCX INFO ep_cfg[2]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984398.287430] [r39c2t5n4:22213:2] ucp_worker.c:1720 UCX INFO ep_cfg[1]: rma(self/memory posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[****] TIME(s) 3.66817 : dpotrf PxQ= 2 2 NB= 128 N= 30720 : 2634.596259 gflops - ENQ&PROG&DEST 3.66848 : 2634.377315 gflops - ENQ 0.00006 - DEST 0.00024
[1624984402.377128] [r39c2t5n3:22452:0] ucp_worker.c:1720 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_0:1); amo(rc_mlx5/mlx5_0:1);
[1624984402.377731] [r39c2t5n2:22162:0] ucp_worker.c:1720 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_0:1); amo(rc_mlx5/mlx5_0:1);
[1624984402.383505] [r39c2t5n4:22213:0] ucp_worker.c:1720 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_0:1); amo(rc_mlx5/mlx5_0:1);
[1624984402.385741] [r39c2t5n1:22453:1] ucp_worker.c:1720 UCX INFO ep_cfg[2]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984402.386517] [r39c2t5n3:22452:0] ucp_worker.c:1720 UCX INFO ep_cfg[1]: rma(rc_mlx5/mlx5_0:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_0:1);
[1624984402.386587] [r39c2t5n2:22162:0] ucp_worker.c:1720 UCX INFO ep_cfg[1]: rma(rc_mlx5/mlx5_0:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_0:1);
[1624984402.388995] [r39c2t5n1:22453:0] ucp_worker.c:1720 UCX INFO ep_cfg[0]: rma(rc_mlx5/mlx5_0:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_0:1);
[1624984402.391044] [r39c2t5n3:22452:0] ucp_worker.c:1720 UCX INFO ep_cfg[3]: rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984402.391484] [r39c2t5n2:22162:2] ucp_worker.c:1720 UCX INFO ep_cfg[3]: rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984402.393319] [r39c2t5n2:22162:2] ucp_worker.c:1720 UCX INFO ep_cfg[4]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984402.393740] [r39c2t5n4:22213:1] ucp_worker.c:1720 UCX INFO ep_cfg[2]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984402.395155] [r39c2t5n3:22452:1] ucp_worker.c:1720 UCX INFO ep_cfg[4]: rma(rc_mlx5/mlx5_1:1); amo(rc_mlx5/mlx5_1:1);
[1624984402.397034] [r39c2t5n4:22213:0] ucp_worker.c:1720 UCX INFO ep_cfg[1]: rma(rc_mlx5/mlx5_0:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_0:1);
[1624984402.397423] [r39c2t5n1:22453:0] ucp_worker.c:1720 UCX INFO ep_cfg[1]: rma(rc_mlx5/mlx5_0:1); amo(rc_mlx5/mlx5_0:1);
[1624984402.401135] [r39c2t5n4:22213:1] ucp_worker.c:1720 UCX INFO ep_cfg[3]: rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
[1624984402.402516] [r39c2t5n1:22453:0] ucp_worker.c:1720 UCX INFO ep_cfg[3]: rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory); amo(rc_mlx5/mlx5_1:1);
I added some more instrumentation points into the UCX code but the results are more confusing than helpful:
0.033 uct_ep_put_zcopy 4.938 { rma_basic.c:63 ucp_rma_basic_progress_put()
0.020 UCT_DC_MLX5_CHECK_RES 0.053 { dc_mlx5_ep.c:476 uct_dc_mlx5_ep_put_zcopy()
0.053 }
0.007 UCT_DC_MLX5_IFACE_TXQP_GET 0.020 { dc_mlx5_ep.c:477 uct_dc_mlx5_ep_put_zcopy()
0.020 }
0.007 uct_rc_mlx5_ep_fence_put 0.006 { dc_mlx5_ep.c:478 uct_dc_mlx5_ep_put_zcopy()
0.006 }
0.007 uct_dc_mlx5_iface_zcopy_post 0.033 { dc_mlx5_ep.c:481 uct_dc_mlx5_ep_put_zcopy()
0.033 }
4.740 UCT_TL_EP_STAT_OP 0.020 { dc_mlx5_ep.c:486 uct_dc_mlx5_ep_put_zcopy()
0.020 }
0.026 }
Note that the individual calls take only a few microseconds but the offset of UCT_TL_EP_STAT_OP
is 4.74ms (if my interpretation is correct). I don't see anything happening in between the two functions so the only explanation I can come up with right now is that it's a scheduling issue. There is no oversubscription on the node though, will have to investigate that further.
I am seeing similar long runtimes for other functions in the traces I have collected (ucs_pgtable_lookup
for example). Maybe I am chasing the wrong issue here (although the original observation of mixing RMA and P2P leading to poor performance is consistent)
@devreal can you try using UCX build in release mode (it will not support profiling, but will make UCT_TL_EP_STAT_OP a NULL operation)?
@yosefe With release mode, do you mean --disable-debug
and --disable-profiling
?
yeah, can use the script ./contrib/configure-release
(wrapper for ./configure
) which turns off all the debug stuff
There is no difference in release mode: I see the same behavior of the version using RMA versus the ones not using RMA.
Is my interpretation of the trace data correct that there is unaccounted time between uct_dc_mlx5_iface_zcopy_post
and UCT_TL_EP_STAT_OP
?
Can you try the same with btl/uct and osc/rdma? osc/ucx still has some serious issues that need to be resolved.
I am having a hard time using osc/rdma on that machine. I'm seeing a segfault in ompi_osc_rdma_lock_all_atomic
(if that's fixed in latest master I will try to update and run again)
Hmmm. I can take a look. It was working. Not sure what could have broken.
Though I did add support for 1.10 in master. Not sure what difference that will make.
Running with osc/rdma without UCX support (probably tcp instead) I get a Segfault in MPI_Win_lock_all
:
#5 main () (at 0x000000000040119d)
#4 PMPI_Win_lock_all () at build/ompi/mpi/c/profile/pwin_lock_all.c:56 (at 0x0000155554ebe9f3)
#3 ompi_osc_rdma_lock_all_atomic (mpi_assert=<optimized out>, win=0x60e260) at ompi/mca/osc/rdma/osc_rdma_passive_target.c:355 (at 0x0000155554fd9b02)
#2 ompi_osc_rdma_lock_acquire_shared () at ompi/mca/osc/rdma/osc_rdma_lock.h:311 (at 0x0000155554fd7a3c)
#1 ompi_osc_rdma_lock_btl_fop (wait_for_completion=true, result=0x7fffffffa128, operand=<optimized out>, op=1, address=<optimized out>, peer=0x8b7f50, module=0x155555301320) at ompi/mca/osc/rdma/osc_rdma_lock.h:113 (at 0x0000155554fd7a3c)
#0 ompi_osc_rdma_btl_fop (cbcontext=0x0, cbdata=0x0, cbfunc=0x0, wait_for_completion=true, result=0x7fffffffa128, flags=0, operand=<optimized out>, op=1, address_handle=<optimized out>, address=<optimized out>, endpoint=<optimized out>, btl_index=<optimized out>, module=0x155555301320) at ompi/mca/osc/rdma/osc_rdma_lock.h:75 (at 0x0000155554fd7a3c)
The issue is that selected_btl
is NULL
.
If I enable the btl/ucx by setting --mca btl_uct_memory_domains all
(seems to be required for btl/ucx to work) I get an assert in a simple test app I'm using:
#20 main () (at 0x000000000040120c)
#19 PMPI_Rget () at build/ompi/mpi/c/profile/prget.c:82 (at 0x0000155554eae4b3)
#18 ompi_osc_rdma_rget (origin_addr=0xa6f070, origin_count=1000000, origin_datatype=0x4042a0, source_rank=<optimized out>, source_disp=0, source_count=1000000, source_datatype=0x4042a0, win=0x772d20, request=0x7fffffffa1e8) at ompi/mca/osc/rdma/osc_rdma_comm.c:909 (at 0x0000155554fc348b)
#17 ompi_osc_rdma_module_sync_lookup (peer=0x7fffffffa0d0, target=1, module=0x9f90b0) at ompi/mca/osc/rdma/osc_rdma.h:505 (at 0x0000155554fc348b)
#16 ompi_osc_rdma_module_peer (peer_id=1, module=0x9f90b0) at ompi/mca/osc/rdma/osc_rdma.h:364 (at 0x0000155554fc348b)
#15 ompi_osc_rdma_peer_lookup (module=module@entry=0x9f90b0, peer_id=peer_id@entry=1) at ompi/mca/osc/rdma/osc_rdma_peer.c:312 (at 0x0000155554fdaeb1)
#14 ompi_osc_rdma_peer_lookup_internal (peer_id=1, module=0x9f90b0) at ompi/mca/osc/rdma/osc_rdma_peer.c:288 (at 0x0000155554fdaeb1)
#13 ompi_osc_rdma_peer_setup (module=module@entry=0x9f90b0, peer=0xe46470) at ompi/mca/osc/rdma/osc_rdma_peer.c:161 (at 0x0000155554fda81d)
#12 ompi_osc_get_data_blocking (module=module@entry=0x9f90b0, btl_index=<optimized out>, endpoint=0xe41c40, source_address=source_address@entry=23455759704080, source_handle=source_handle@entry=0x15553835a1b8, data=data@entry=0x7fffffffa018, len=8) at ompi/mca/osc/rdma/osc_rdma_comm.c:121 (at 0x0000155554fc0195)
#11 ompi_osc_rdma_progress (module=0x155554fbaf60) at ompi/mca/osc/rdma/osc_rdma.h:409 (at 0x0000155554fc0195)
#10 opal_progress () at opal/runtime/opal_progress.c:224 (at 0x00001555544b5383)
#9 mca_btl_uct_component_progress () at opal/mca/btl/uct/btl_uct_component.c:623 (at 0x0000155554559dcf)
#8 mca_btl_uct_tl_progress (starting_index=<optimized out>, tl=<optimized out>) at opal/mca/btl/uct/btl_uct_component.c:566 (at 0x0000155554559dcf)
#7 mca_btl_uct_tl_progress (tl=<optimized out>, starting_index=<optimized out>) at opal/mca/btl/uct/btl_uct_component.c:572 (at 0x000015555455999e)
#6 mca_btl_uct_context_progress (context=0x62a970) at opal/mca/btl/uct/btl_uct_device_context.h:168 (at 0x000015555455999e)
#5 mca_btl_uct_device_handle_completions (dev_context=<optimized out>) at opal/mca/btl/uct/btl_uct_device_context.h:150 (at 0x000015555455999e)
#4 ompi_osc_get_data_complete (btl=<optimized out>, endpoint=<optimized out>, local_address=<optimized out>, local_handle=<optimized out>, context=<optimized out>, data=<optimized out>, status=-1) at ompi/mca/osc/rdma/osc_rdma_comm.c:49 (at 0x0000155554fbaf94)
#3 ompi_osc_get_data_complete (btl=<optimized out>, endpoint=<optimized out>, local_address=<optimized out>, local_handle=<optimized out>, context=<optimized out>, data=<optimized out>, status=-1) at ompi/mca/osc/rdma/osc_rdma_comm.c:53 (at 0x0000155554fbaf94)
#2 __assert_fail () from /lib64/libc.so.6 (at 0x0000155554813de6)
#1 __assert_fail_base.cold.0 () from /lib64/libc.so.6 (at 0x0000155554805b09)
#0 abort () from /lib64/libc.so.6 (at 0x0000155554805b0e)
I don't have the resources at this time to debug either of these issues...
Here are two more interesting data points running the same PaRSEC/DPLASMA benchmarks with MPT and MPICH:
1) MPT (HPE's MPI implementation) does not use UCX but sits on ibverbs directly:
The lines are identical for all three variants. Hence: No UCX, no problem.
2) Running on MPICH (recent master) built against UCX:
Here we see the exact same pattern as with Open MPI. This to me suggests that the problem is neither in the pml/ucx nor osc/ucx of Open MPI but inside of UCX.
@yosefe Any more ideas how to track this down?
@devreal, can you please try to rerun the test with UCX_RC_TX_POLL_ALWAYS=y
?
This will help to check the theory that originator is flooded by the unexpected messages.
@brminich
can you please try to rerun the test with UCX_RC_TX_POLL_ALWAYS=y? This will help to check the theory that originator is flooded by the unexpected messages.
Thanks for the suggestion. Setting UCX_RC_TX_POLL_ALWAYS=y
does help somewhat (~50% speedup) but the performance is an order of magnitude lower than without RMA.
One observation that I am not sure became clear earlier: performance actually drops further for larger tile sizes (with constant matrix size), which a) reduces the number of messages/transfers and b) increases the per-transfer size. That too seems to be an argument against the theory that a flood of messages is causing the issue.
Any other UCX env variables I could try? There are too many to tinker with all of them but I sure can do a sweep over a selected subset. I simply cannot judge which ones are worth a try...
@devreal, is this MPI extension publicly available, so I could try to reproduce it locally?
another difference is that we support multirail for p2tp operations while we do not support it for RMA yet. How many HCAs per node do you have? If you have more than 1 mlnx device, can you please set UCX_MAX_RNDV_RAILS=1 and try to run the app without RMA? Would it match RMA results?
@brminich I tried running both benchmarks UCX_MAX_RNDV_RAILS=1
and it didn't make any difference.
There is no need to use the experimental MPI feature I'm developing. All of the measurements here were done using a branch of the PaRSEC runtime that mirrors comminication, either through another pair of send/recv or through a put into a window. The benchmark itself is the Cholesky factorization test of DPLASMA.
I set up a DPLASMA branch that contains PaRSEC with the RMA-mirror. You can get it with:
git clone --recursive -b dplasma_rma_parsec https://bitbucket.org/schuchart/dplasma/ dplasma_git_schuchart_fakerma
To build:
$ mkdir build
$ cd build
$ cmake .. -DDPLASMA_PRECISIONS=d
You'll need the MKL to be available (by setting MKLROOT
) and HWLOC (if not in your default search path, you can point CMake to it through -DHWLOC_ROOT=...
).
You can then run the benchmark through:
# the per-node matrix size
NPN=$((15*1024))
# total matrix size
N=$(($P*$NPN))
# compute square process grid
Q=$((${NUM_NODES} / $(echo "scale=0; sqrt ( ${NUM_NODES} ) " | bc -l)))
P=$((${NUM_NODES} / $Q))
# tile size
NB=128
mpirun -n $NUM_NODES build/tests/testing_dpotrf -N $N -v -c $NTHREADS -NB $NB -P $P -Q $Q
On Hawk, I'm running with NTHREADS=32
with one process per node. Let me know if you need any support in building or running the benchmark.
@devreal, how do you specify that you want ro run parsec p2p or parsec rma?
@brminich Sorry, forgot to mention that. The version in the branch I gave you contains the RMA version. The other variants are in different branches. They are all ABI compatible and can be simply interposed by setting LD_LIBRARY_PATH
when running the DPLASMA benchmark. I didn't have the time to combine them into one before going on vacation.
If you want to compare with the duplicated send/recv, you can use the following PaRSEC branch:
$ git clone --branch topic/revamp_comm_engine_symbols_doublesend https://bitbucket.org/schuchart/parsec.git parsec_doublesend
The baseline (without duplicated traffic) would be:
$ git clone --branch topic/revamp_comm_engine_symbols https://bitbucket.org/schuchart/parsec.git parsec_vanilla
In both cases you can use a simple cmake && make
to build the code, no special arguments to CMake should be necessary if MPI is in your PATH
.
@devreal, do you see a noticeable degradation only starting from ~16 nodes? Is there any valid parameters combination which would show the difference on smaller scale? (the smaller scale is the better)
@devreal, ping
@brminich Sorry I was on vacation with no access to the system. I played around with different problem/tile sizes but couldn't reproduce it at lower scales. I also noticed that the problem does not occur before 64 nodes (49 nodes still yield the expected performance). However, at 81 the RMA performance is bad too. There seems to be some threshold.
Is there a hash table or something similar whose size I could tune to see if that has any impact?
@devreal, Do I understand correctly that you see performance drop of RMA based implementation comparing to 2-sided version? If yes, what do you mean by "at 81 the RMA performance is bad too"?
Do you have adapting routing configured on the system? Can you please try to add UCX_IB_AR_ENABLE=y
?
@brminich Sorry for the delay, it took me some time to get runs though on the machine...
Do I understand correctly that you see performance drop of RMA based implementation comparing to 2-sided version?
Correct. The experiments are such that the RMA version performs the same send/recv communication as vanilla PaRSEC plus mirroring the traffic through MPI RMA. The send/recv version mirrors the traffic using a second pair of send/recv. The version mirroring traffic through RMA shows poor performance starting at 64 nodes while the version mirroring through send/recv shows the same performance as vanilla PaRSEC.
If yes, what do you mean by "at 81 the RMA performance is bad too"?
Sorry that was a typo: it should have been "81 nodes", meaning that the problem not just occurs on 64 nodes but on scales beyond that.
Do you have adapting routing configured on the system?
It is enabled yes. I don't think I have the powers to change that, unfortunately.
Can you please try to add UCX_IB_AR_ENABLE=y?
If I set that I get the following error:
[1628215239.795079] [r3c3t1n1:343788:0] ib_mlx5.c:836 UCX ERROR AR=y was requested for SL=auto, but could not detect AR mask for SLs. Please, set SL manually with AR on mlx5_1:1, SLs with AR support = { <none> }, SLs without AR support = { <none> }
How do I get the SL mask?
@devreal, do you know if it is configured on all SLs? If you do not know exact SLs where AR is configured you can try to find it this way:
Then please use UCX_IB_SL env variable to set specific SL with UCX
@devreal is this related to the severe performance variation I encountered on Expanse (AMD Rome) with DPLASMA, Open MPI, and UCX? I had reported it to and discussed it with @bosilca; I also encountered similar problems with MPICH and UCX.
@omor1 Not sure what your variations look like, will talk to George. It's unlikely though since this is related to MPI RMA, which is normally not used in DPLASMA.
@brminich Sorry for the delay. Unfortunately, I don't have sudo
privileges on that system. I will try to find another system to run the benchmarks on to verify whether this is specific to this machine.
@devreal I was seeing significant (as much as 4x) run-to-run variation with dgemm; using btl/openib
got rid of this massive variance, but I ran into other issues with it (there's a reason it's not maintained/removed). George mentioned he sent this to you several months ago, in late May. I will retest and play around with some of the UCX env vars mentioned to see if they make a difference.
@omor1, is there a corresponding issue for this performance variance?
I have been working on an experimental extension to MPI that is built on top of osc/ucx. Unfortunately, during experiments with PaRSEC I found that the combination of osc/ucx and pml/ucx leads to a significant performance degradation. I have not been able to reproduce this with a small example but I have been able to observe this behavior in a modified PaRSEC runtime where the data payload is transferred through both P2P and RMA operations.
In the PaRSEC runtime, a single thread manages the communication of all task inputs/outputs by sending active messages and posting nonblocking sends and receives. The active messages may potentially lead to a high number of unexpected messages. In the below figure, this is
PaRSEC vanilla
. A certain loss in per-node performance in such a relatively small Cholesky factorization with small tiles is expected as more nodes require more active messages and the communication becomes dominant.I have modified the runtime such that the sent payload is also put into a preallocated window, essentially doubling the amount of data transferred (
PaRSEC P2P+RMA
). This leads to a significant drop in performance at 64 nodes when using pml/ucx. Inte With pml/ob1 (probably usingbtl/tcp
) the base line is much lower but the slope matches that of vanilla PaRSEC, outperforming pml/ucx at 64 nodes.To make sure this is not a bandwidth issue, I used a second modified version in which the data is sent twice using send/recv (
PaRSEC 2xP2P
). In this case, the performance is mostly similar to the vanilla runtime. This leads me to conclude that this is not a bandwidth issue.I had manually instrumented osc/ucx to warn about long calls into UCX and found that calls to
ucp_put_nbi
repeatedly took several milliseconds (up to approx. 20ms for a single call), which is likely the reason for the performance drop as other processes are starving while one is stuck in a put. However, I would like to know what is happening there as I want to combine P2P and RMA going forward. My theory of the case is that somehow the unexpected messages are handled inside the call to put. Interestingly though, the effect seems to be worse when the transfer size is increased (e.g., by increasing the tile size from 128 to 320). Any idea how I can find out what is happening inside UCX?The system under test is a HPE Apollo with dual-socket AMD Rome 64 core processors with a HDR200 interconnect. Everything is compiled using GCC 10.2.0.
This was tested with a recent Open MPI master (two weeks ago, I have not seen any changes that may have had an impact here). I have observed similar behavior with 4.1.1.
UCX 1.10.0 configuration: