openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 428 forks source link

UCX Error and Slow OSU Benchmark Performance on RHEL9 #9404

Open yangliu2009 opened 1 year ago

yangliu2009 commented 1 year ago

I installed openmpi 4.0.7 and 4.1.5 on RHEL 9. When I ran OSU Benchmarks, I received the following error messages. Can these error messages be ignored? If yes, how to remove the messages?

[node1804:235828] common_ucx.c:404 waiting for 1 disconnect requests
[node1804:235828] common_ucx.h:153 ucp_disconnect_nb failed: 1, Operation in progress
[node1804:235828] common_ucx.c:440 disconnecting from rank 7
[1696387831.419127] [node1804:235828:0]           flush.c:56   UCX  ERROR req 0x24b9300: error during flush: Connection reset by remote peer
[1696387831.419133] [node1804:235828:0]           flush.c:56   UCX  ERROR req 0x24b9300: error during flush: Connection reset by remote peer
[node:1804235828] common_ucx.c:444  Error: ucp_disconnect_nb(7) failed: Connection reset by remote peer
[node1804:235828] common_ucx.c:404 waiting for 0 disconnect requests

The performance of OSU benchmark is worse on RHEL9 than our current RHEL7. Openmpi/4.0.7 On RHEL7 doesn't use UCX while openmpi on RHEL9 does. I am not sure if OSU benchmark is good to measure UCX performance or not. If it does, is there a way to improve openmpi performance on RHEL9?

Below is the openmpi/4.0.7 performance on RHEL7:
# OSU MPI Allgather Latency Test v5.6.3
# Size       Avg Latency(us)
1                       8.11
2                       8.09
4                       8.99
8                      10.07
16                     10.97
32                     12.17
64                     14.74
128                    22.77
256                    37.48
512                    69.06
1024                  104.94
2048                  281.41
4096                  360.24
8192                  531.00
16384                 837.46
32768                1496.04
65536                3063.41
131072               6136.27
262144              11629.43
524288              22941.44
1048576             55586.29

Below is the openmpi/4.0.7 performance on RHEL9:

# OSU MPI Allgather Latency Test v5.6.3
# Size       Avg Latency(us)
1                       9.83
2                      10.03
4                      10.44
8                      11.04
16                     12.50
32                     14.19
64                     17.73
128                    25.30
256                    39.88
512                    68.69
1024                  161.68
2048                  276.32
4096                  499.27
8192                  995.36
16384                1999.96
32768                3992.45
65536                7998.37
131072              15678.02
262144              31147.47
524288              62086.76
1048576            123985.42

Below is the openmpi/4.1.5 performance on RHEL9:

# OSU MPI Allgather Latency Test v5.6.3
# Size       Avg Latency(us)
1                       9.75
2                       9.94
4                      10.28
8                      10.91
16                     12.23
32                     13.91
64                     17.33
128                    24.69
256                    39.53
512                   107.36
1024                  156.13
2048                  273.92
4096                  499.03
8192                  995.29
16384                1998.06
32768                3993.13
65536                8012.16
131072              16044.28
262144              31338.91
524288              62296.19
1048576            124212.83
yosefe commented 1 year ago

@yangliu2009 can you pls fill the details according to the new issue template?