openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.11k stars 417 forks source link

RC vs MLX5_RC performance in SHMEM #1218

Open shamisp opened 7 years ago

shamisp commented 7 years ago

Issue: RC transport demonstrate better latency than MLX5 in OpenSHMEM 8byte ping-wait-pong benchmark. Debug: The source cause is the number of cqe are pulled by poll_cq. RC polls multiple cqe's while MLX5 polls one by one.

shamisp commented 7 years ago

For the benchmarking I used osu_oshm_put benchmark (modified version of oshmem osu ping-pong, shmem benchmark with 8 byte put). The benchmark directly polls on 8 bytes. You also can use wait() but then remove the progress from the wait.

(or you can use mlx5 instead of RC with UCT_IB_TX_MAX_POLL=1)

ompi-ucx/bin/mpirun -np 2 -npernode 1 -H xxx,yyy --mca pml ucx --mca spml ucx --mca btl '^openib,sm,vader,tcp,self' -x UCX_TLS=rc,ud -x UCX_IB_TX_MAX_POLL=1  osu-micro-benchmarks-5.3/openshmem/osu_oshm_put heap

vs

ompi-ucx/bin/mpirun -np 2 -npernode 1 -H xxx,yyy --mca pml ucx --mca spml ucx --mca btl '^openib,sm,vader,tcp,self' -x UCX_TLS=rc,ud  osu-micro-benchmarks-5.3/openshmem/osu_oshm_put heap
yosefe commented 7 years ago

i can reproduce the problem, but adding more polling does not help mlx5 (even though less possling makes verbs worse)

yosefe commented 7 years ago

@shamisp can you try UCX_TLS=\\rc,ud_mlx5 vs. UCX_TLS=\\rc_mlx5,ud_mlx5? Looks like it happens because of ud_verbs progress loops (which we intend to remove anyway)

shamisp commented 7 years ago

@yosefe Hmm... why ud_verbs has no impact on RC ?

UCX_TLS=rc_mlx5,ud,mm -> 1,85 usec UCX_TLS=rc_mlx5,ud_mlx5,mm -> 1,71 usec UCX_TLS=rc,ud,mm -> 1,71 usec UCX_TLS=rc,ud_mlx5,mm -> 1,74 usec

yosefe commented 7 years ago

@shamisp as we discussed with @sameerkm in the past, the latency measurement is not accurate. its granularity is according to the "polling interval". in some cases, extra SW overhead may (paradoxically) result in better latency number, if it makes the packet arrive closer to the end of the polling cycle.

shamisp commented 7 years ago

The question is how do we prove it that this is the case.

yosefe commented 7 years ago

adding @alex-mikheev we are making optimizations to remove ud from progress when it's unused. if after that we see better perf with rc_x than with rc, I think we could close the issue.

shamisp commented 7 years ago

I want to rerun all test with UCT_IB_TX_MAX_POLL=1 and see what are the numbers. I think this is one the main diff that have been identified.

yosefe commented 7 years ago

according to my experiments, polling multiple times in rc_mlx5 does not improve the performance.

shamisp commented 7 years ago

got it... interesting.

On Fri, Mar 17, 2017 at 10:25 AM, Yossi notifications@github.com wrote:

according to my experiments, polling multiple times in rc_mlx5 does not improve the performance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openucx/ucx/issues/1218#issuecomment-287385019, or mute the thread https://github.com/notifications/unsubscribe-auth/ACIe2JLQ0S-aRPSNVFNv-bOl5gAYrtW-ks5rmqXUgaJpZM4LZ2cQ .

shamisp commented 7 years ago

The same experiment with UCT_IB_TX_MAX_POLL=1 UCX_TLS=rc_mlx5,ud,mm -> 1.82 UCX_TLS=rc_mlx5,ud_mlx5,mm -> 1.75 UCX_TLS=rc,ud,mm -> 1.70 UCX_TLS=rc,ud_mlx5,mm -> 1.74