kokkos / kokkos-comm

Experimental MPI Wrapper for Kokkos
https://kokkos.org/kokkos-comm/
Other
12 stars 9 forks source link

OSU latency tests are not strictly equivalent #104

Closed dssgabriel closed 1 week ago

dssgabriel commented 2 weeks ago

Our raw MPI implementation has a barrier which the KokkosComm one does not. This leads to very different results (at least on my machine).

Compiling the current OSU latency bench in "Release" with MPICH 4.1.1 gives me the following results:

Current implementation ``` 1: Test command: /usr/lib64/mpich/bin/mpiexec "-n" "2" "./perf_test-main" 1: Working Directory: /home/dossantosg/dev/kokkos-comm/build/release/perf_tests 1: Test timeout computed to be: 10000000 1: 2024-07-04T10:05:04+02:00 1: Running ./perf_test-main 1: Run on (16 X 4500 MHz CPU s) 1: CPU Caches: 1: L1 Data 48 KiB (x8) 1: L1 Instruction 32 KiB (x8) 1: L2 Unified 1280 KiB (x8) 1: L3 Unified 18432 KiB (x1) 1: Load Average: 0.48, 0.55, 0.67 1: ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. 1: ----------------------------------------------------------------------------------------------------------------------- 1: Benchmark Time CPU Iterations UserCounters... 1: ----------------------------------------------------------------------------------------------------------------------- 1: benchmark_osu_latency_KokkosComm_isendirecv/1/manual_time 0.295 us 0.567 us 2291565 bytes=2 1: benchmark_osu_latency_KokkosComm_isendirecv/2/manual_time 0.300 us 0.579 us 2318277 bytes=4 1: benchmark_osu_latency_KokkosComm_isendirecv/4/manual_time 0.303 us 0.575 us 2344989 bytes=8 1: benchmark_osu_latency_KokkosComm_isendirecv/8/manual_time 0.294 us 0.553 us 2312104 bytes=16 1: benchmark_osu_latency_KokkosComm_isendirecv/16/manual_time 0.281 us 0.543 us 2497346 bytes=32 1: benchmark_osu_latency_KokkosComm_isendirecv/32/manual_time 0.329 us 0.592 us 2187558 bytes=64 1: benchmark_osu_latency_KokkosComm_isendirecv/64/manual_time 0.328 us 0.596 us 2125078 bytes=128 1: benchmark_osu_latency_KokkosComm_isendirecv/128/manual_time 0.365 us 0.633 us 1936094 bytes=256 1: benchmark_osu_latency_KokkosComm_isendirecv/256/manual_time 0.409 us 0.674 us 1713554 bytes=512 1: benchmark_osu_latency_KokkosComm_isendirecv/512/manual_time 0.442 us 0.710 us 1581854 bytes=1.024k 1: benchmark_osu_latency_KokkosComm_isendirecv/1000/manual_time 0.520 us 0.816 us 1353214 bytes=2k 1: benchmark_osu_latency_MPI_isendirecv/1/manual_time 0.737 us 0.989 us 936550 bytes=2 1: benchmark_osu_latency_MPI_isendirecv/2/manual_time 0.739 us 0.989 us 950361 bytes=4 1: benchmark_osu_latency_MPI_isendirecv/4/manual_time 0.732 us 0.984 us 964262 bytes=8 1: benchmark_osu_latency_MPI_isendirecv/8/manual_time 0.730 us 0.983 us 949770 bytes=16 1: benchmark_osu_latency_MPI_isendirecv/16/manual_time 0.706 us 0.940 us 992884 bytes=32 1: benchmark_osu_latency_MPI_isendirecv/32/manual_time 0.734 us 0.969 us 953091 bytes=64 1: benchmark_osu_latency_MPI_isendirecv/64/manual_time 0.726 us 0.965 us 963782 bytes=128 1: benchmark_osu_latency_MPI_isendirecv/128/manual_time 0.767 us 1.02 us 904111 bytes=256 1: benchmark_osu_latency_MPI_isendirecv/256/manual_time 0.769 us 1.03 us 909894 bytes=512 1: benchmark_osu_latency_MPI_isendirecv/512/manual_time 0.789 us 1.07 us 868943 bytes=1.024k 1: benchmark_osu_latency_MPI_isendirecv/1000/manual_time 0.890 us 1.18 us 796200 bytes=2k 1: benchmark_osu_latency_KokkosComm_sendrecv/1/manual_time 0.241 us 0.507 us 2903506 bytes=2 1: benchmark_osu_latency_KokkosComm_sendrecv/2/manual_time 0.240 us 0.506 us 2919416 bytes=4 1: benchmark_osu_latency_KokkosComm_sendrecv/4/manual_time 0.241 us 0.509 us 2901200 bytes=8 1: benchmark_osu_latency_KokkosComm_sendrecv/8/manual_time 0.243 us 0.511 us 2880014 bytes=16 1: benchmark_osu_latency_KokkosComm_sendrecv/16/manual_time 0.249 us 0.518 us 2762902 bytes=32 1: benchmark_osu_latency_KokkosComm_sendrecv/32/manual_time 0.251 us 0.523 us 2890051 bytes=64 1: benchmark_osu_latency_KokkosComm_sendrecv/64/manual_time 0.247 us 0.514 us 2817314 bytes=128 1: benchmark_osu_latency_KokkosComm_sendrecv/128/manual_time 0.263 us 0.538 us 2661033 bytes=256 1: benchmark_osu_latency_KokkosComm_sendrecv/256/manual_time 0.297 us 0.587 us 2362166 bytes=512 1: benchmark_osu_latency_KokkosComm_sendrecv/512/manual_time 0.334 us 0.629 us 2098353 bytes=1.024k 1: benchmark_osu_latency_KokkosComm_sendrecv/1000/manual_time 0.452 us 0.749 us 1535763 bytes=2k 1: benchmark_osu_latency_MPI_sendrecv/1/manual_time 0.690 us 0.964 us 999977 bytes=2 1: benchmark_osu_latency_MPI_sendrecv/2/manual_time 0.664 us 0.930 us 1013927 bytes=4 1: benchmark_osu_latency_MPI_sendrecv/4/manual_time 0.674 us 0.939 us 1043795 bytes=8 1: benchmark_osu_latency_MPI_sendrecv/8/manual_time 0.676 us 0.943 us 1028589 bytes=16 1: benchmark_osu_latency_MPI_sendrecv/16/manual_time 0.685 us 0.960 us 1020056 bytes=32 1: benchmark_osu_latency_MPI_sendrecv/32/manual_time 0.706 us 0.988 us 998717 bytes=64 1: benchmark_osu_latency_MPI_sendrecv/64/manual_time 0.701 us 0.982 us 997674 bytes=128 1: benchmark_osu_latency_MPI_sendrecv/128/manual_time 0.725 us 1.01 us 969073 bytes=256 1: benchmark_osu_latency_MPI_sendrecv/256/manual_time 0.736 us 1.02 us 954773 bytes=512 1: benchmark_osu_latency_MPI_sendrecv/512/manual_time 0.768 us 1.05 us 921625 bytes=1.024k 1: benchmark_osu_latency_MPI_sendrecv/1000/manual_time 0.851 us 1.14 us 827252 bytes=2k 1/1 Test #1: perf_test-main ................... Passed 68.37 sec ```

We appear to be much faster (1.5 to 2.5x faster) than raw MPI.

However, if we remove the barrier in the MPI implementation and increase the message size range to 131,072 (above the ready/synchronous send behavior switch), we get much closer results:

No barrier and higher max message size ``` 1: Test command: /usr/lib64/mpich/bin/mpiexec "-n" "2" "./perf_test-main" 1: Working Directory: /home/dossantosg/dev/kokkos-comm/build/release/perf_tests 1: Test timeout computed to be: 10000000 1: 2024-07-04T10:30:43+02:00 1: Running ./perf_test-main 1: Run on (16 X 4500 MHz CPU s) 1: CPU Caches: 1: L1 Data 48 KiB (x8) 1: L1 Instruction 32 KiB (x8) 1: L2 Unified 1280 KiB (x8) 1: L3 Unified 18432 KiB (x1) 1: Load Average: 0.24, 0.22, 0.27 1: ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. 1: ------------------------------------------------------------------------------------------------------------------------- 1: Benchmark Time CPU Iterations UserCounters... 1: ------------------------------------------------------------------------------------------------------------------------- 1: benchmark_osu_latency_KokkosComm_isendirecv/1/manual_time 0.311 us 0.582 us 2291675 bytes=2 1: benchmark_osu_latency_KokkosComm_isendirecv/2/manual_time 0.306 us 0.588 us 2278004 bytes=4 1: benchmark_osu_latency_KokkosComm_isendirecv/4/manual_time 0.310 us 0.589 us 2259080 bytes=8 1: benchmark_osu_latency_KokkosComm_isendirecv/8/manual_time 0.305 us 0.588 us 2309790 bytes=16 1: benchmark_osu_latency_KokkosComm_isendirecv/16/manual_time 0.308 us 0.589 us 2293957 bytes=32 1: benchmark_osu_latency_KokkosComm_isendirecv/32/manual_time 0.340 us 0.614 us 2094313 bytes=64 1: benchmark_osu_latency_KokkosComm_isendirecv/64/manual_time 0.347 us 0.620 us 2050861 bytes=128 1: benchmark_osu_latency_KokkosComm_isendirecv/128/manual_time 0.377 us 0.643 us 1875787 bytes=256 1: benchmark_osu_latency_KokkosComm_isendirecv/256/manual_time 0.398 us 0.652 us 1770834 bytes=512 1: benchmark_osu_latency_KokkosComm_isendirecv/512/manual_time 0.448 us 0.706 us 1592269 bytes=1.024k 1: benchmark_osu_latency_KokkosComm_isendirecv/1024/manual_time 0.535 us 0.829 us 1317132 bytes=2.048k 1: benchmark_osu_latency_KokkosComm_isendirecv/2048/manual_time 0.681 us 0.975 us 1042584 bytes=4.096k 1: benchmark_osu_latency_KokkosComm_isendirecv/4096/manual_time 0.831 us 1.13 us 840202 bytes=8.192k 1: benchmark_osu_latency_KokkosComm_isendirecv/8192/manual_time 1.34 us 1.64 us 523542 bytes=16.384k 1: benchmark_osu_latency_KokkosComm_isendirecv/16384/manual_time 2.29 us 2.60 us 304826 bytes=32.768k 1: benchmark_osu_latency_KokkosComm_isendirecv/32768/manual_time 3.74 us 4.05 us 183694 bytes=65.536k 1: benchmark_osu_latency_KokkosComm_isendirecv/65536/manual_time 5.37 us 5.63 us 130552 bytes=131.072k 1: benchmark_osu_latency_KokkosComm_isendirecv/131072/manual_time 8.31 us 8.56 us 84056 bytes=262.144k 1: benchmark_osu_latency_MPI_isendirecv/1/manual_time 0.270 us 0.525 us 2668300 bytes=2 1: benchmark_osu_latency_MPI_isendirecv/2/manual_time 0.296 us 0.551 us 2363747 bytes=4 1: benchmark_osu_latency_MPI_isendirecv/4/manual_time 0.278 us 0.532 us 2434239 bytes=8 1: benchmark_osu_latency_MPI_isendirecv/8/manual_time 0.267 us 0.518 us 2631995 bytes=16 1: benchmark_osu_latency_MPI_isendirecv/16/manual_time 0.268 us 0.517 us 2634188 bytes=32 1: benchmark_osu_latency_MPI_isendirecv/32/manual_time 0.312 us 0.571 us 2243459 bytes=64 1: benchmark_osu_latency_MPI_isendirecv/64/manual_time 0.303 us 0.555 us 2263544 bytes=128 1: benchmark_osu_latency_MPI_isendirecv/128/manual_time 0.335 us 0.604 us 2105784 bytes=256 1: benchmark_osu_latency_MPI_isendirecv/256/manual_time 0.359 us 0.617 us 1911858 bytes=512 1: benchmark_osu_latency_MPI_isendirecv/512/manual_time 0.391 us 0.671 us 1797920 bytes=1.024k 1: benchmark_osu_latency_MPI_isendirecv/1024/manual_time 0.475 us 0.757 us 1472023 bytes=2.048k 1: benchmark_osu_latency_MPI_isendirecv/2048/manual_time 0.657 us 0.937 us 1062522 bytes=4.096k 1: benchmark_osu_latency_MPI_isendirecv/4096/manual_time 0.784 us 1.07 us 875927 bytes=8.192k 1: benchmark_osu_latency_MPI_isendirecv/8192/manual_time 1.30 us 1.58 us 543917 bytes=16.384k 1: benchmark_osu_latency_MPI_isendirecv/16384/manual_time 2.29 us 2.57 us 307345 bytes=32.768k 1: benchmark_osu_latency_MPI_isendirecv/32768/manual_time 3.90 us 4.20 us 180666 bytes=65.536k 1: benchmark_osu_latency_MPI_isendirecv/65536/manual_time 5.45 us 5.69 us 126047 bytes=131.072k 1: benchmark_osu_latency_MPI_isendirecv/131072/manual_time 8.45 us 8.69 us 82479 bytes=262.144k 1: benchmark_osu_latency_KokkosComm_sendrecv/1/manual_time 0.253 us 0.525 us 2789368 bytes=2 1: benchmark_osu_latency_KokkosComm_sendrecv/2/manual_time 0.261 us 0.528 us 2720577 bytes=4 1: benchmark_osu_latency_KokkosComm_sendrecv/4/manual_time 0.262 us 0.530 us 2725477 bytes=8 1: benchmark_osu_latency_KokkosComm_sendrecv/8/manual_time 0.264 us 0.534 us 2676273 bytes=16 1: benchmark_osu_latency_KokkosComm_sendrecv/16/manual_time 0.274 us 0.553 us 2518242 bytes=32 1: benchmark_osu_latency_KokkosComm_sendrecv/32/manual_time 0.271 us 0.547 us 2528843 bytes=64 1: benchmark_osu_latency_KokkosComm_sendrecv/64/manual_time 0.264 us 0.539 us 2668844 bytes=128 1: benchmark_osu_latency_KokkosComm_sendrecv/128/manual_time 0.278 us 0.569 us 2456403 bytes=256 1: benchmark_osu_latency_KokkosComm_sendrecv/256/manual_time 0.283 us 0.579 us 2446539 bytes=512 1: benchmark_osu_latency_KokkosComm_sendrecv/512/manual_time 0.330 us 0.633 us 2198542 bytes=1.024k 1: benchmark_osu_latency_KokkosComm_sendrecv/1024/manual_time 0.463 us 0.764 us 1504927 bytes=2.048k 1: benchmark_osu_latency_KokkosComm_sendrecv/2048/manual_time 0.609 us 0.913 us 1169746 bytes=4.096k 1: benchmark_osu_latency_KokkosComm_sendrecv/4096/manual_time 0.807 us 1.11 us 864433 bytes=8.192k 1: benchmark_osu_latency_KokkosComm_sendrecv/8192/manual_time 1.33 us 1.64 us 529714 bytes=16.384k 1: benchmark_osu_latency_KokkosComm_sendrecv/16384/manual_time 2.30 us 2.61 us 303424 bytes=32.768k 1: benchmark_osu_latency_KokkosComm_sendrecv/32768/manual_time 3.84 us 4.15 us 181786 bytes=65.536k 1: benchmark_osu_latency_KokkosComm_sendrecv/65536/manual_time 5.48 us 5.73 us 128899 bytes=131.072k 1: benchmark_osu_latency_KokkosComm_sendrecv/131072/manual_time 8.33 us 8.59 us 84304 bytes=262.144k 1: benchmark_osu_latency_MPI_sendrecv/1/manual_time 0.226 us 0.482 us 3081394 bytes=2 1: benchmark_osu_latency_MPI_sendrecv/2/manual_time 0.225 us 0.484 us 3195518 bytes=4 1: benchmark_osu_latency_MPI_sendrecv/4/manual_time 0.216 us 0.472 us 3191217 bytes=8 1: benchmark_osu_latency_MPI_sendrecv/8/manual_time 0.212 us 0.467 us 3317286 bytes=16 1: benchmark_osu_latency_MPI_sendrecv/16/manual_time 0.249 us 0.523 us 2823328 bytes=32 1: benchmark_osu_latency_MPI_sendrecv/32/manual_time 0.265 us 0.544 us 2653429 bytes=64 1: benchmark_osu_latency_MPI_sendrecv/64/manual_time 0.270 us 0.550 us 2602826 bytes=128 1: benchmark_osu_latency_MPI_sendrecv/128/manual_time 0.271 us 0.555 us 2573691 bytes=256 1: benchmark_osu_latency_MPI_sendrecv/256/manual_time 0.298 us 0.587 us 2453177 bytes=512 1: benchmark_osu_latency_MPI_sendrecv/512/manual_time 0.331 us 0.617 us 2096311 bytes=1.024k 1: benchmark_osu_latency_MPI_sendrecv/1024/manual_time 0.433 us 0.719 us 1615782 bytes=2.048k 1: benchmark_osu_latency_MPI_sendrecv/2048/manual_time 0.575 us 0.863 us 1225928 bytes=4.096k 1: benchmark_osu_latency_MPI_sendrecv/4096/manual_time 0.765 us 1.05 us 916542 bytes=8.192k 1: benchmark_osu_latency_MPI_sendrecv/8192/manual_time 1.24 us 1.53 us 561453 bytes=16.384k 1: benchmark_osu_latency_MPI_sendrecv/16384/manual_time 2.24 us 2.54 us 314079 bytes=32.768k 1: benchmark_osu_latency_MPI_sendrecv/32768/manual_time 3.76 us 4.06 us 181342 bytes=65.536k 1: benchmark_osu_latency_MPI_sendrecv/65536/manual_time 5.41 us 5.66 us 131464 bytes=131.072k 1: benchmark_osu_latency_MPI_sendrecv/131072/manual_time 8.62 us 8.87 us 81134 bytes=262.144k 1/1 Test #1: perf_test-main ................... Passed 117.57 sec ```

Testing with OpenMPI 4.1.1 yields similar results (marginally slower than MPICH), although I don't think sub-microseconds differences are meaningful.

It may be worth increasing the maximum message size even more (something that does not fit into LLC? e.g. >128 MiB, as server CPU caches are quite large?). We may consider implementing the multi-threaded and multi-process latency variants, as well as the (bidirectional) bandwidth benchmarks so we can better characterize KokkosComm's performance.

nicoleavans commented 1 week ago

These are the offending barriers:

https://github.com/kokkos/kokkos-comm/blob/5c14fd24cbea486543eb63f2df6b123a3d1f837e/perf_tests/test_osu_latency_sendrecv.cpp#L35

&

https://github.com/nicoleavans/kokkos-comm/blob/5c14fd24cbea486543eb63f2df6b123a3d1f837e/perf_tests/test_osu_latency_isendirecv.cpp#L38

I will remove them for completeness. Thank you!