Open oleotiger opened 2 years ago
@oleotiger if there is a change in the IP configuration of the RoCE network device, it may be causing a GID_CHANGE event. Some changes may cause running applications using UCX to stop working, in case the IP/GID it was using is affected.
@oleotiger if there is a change in the IP configuration of the RoCE network device, it may be causing a GID_CHANGE event. Some changes may cause running applications using UCX to stop working, in case the IP/GID it was using is affected.
I'm sure that during the running of the application there is no change of IP configuration of the RoCE netowrk device. May there be any other reasons that cause UCX to print the warning?
I'm sure that during the running of the application there is no change of IP configuration of the RoCE netowrk device. May there be any other reasons that cause UCX to print the warning?
can you try to capture dmesg log from the time period the UCX application was running?
Firstly I cleaned the dmesg log with dmesg -C
.
Then I start the UCX application and warning ib_device.c:482 UCX WARN IB Async event on mlx5_0: GID table change on port 1
was printed.
dmesg -T
shows empty dmesg log.
@jgunthorpe why could such event be generated without any network change?
Most likely there is a configuration change. For instance an IPv6 temporary privacy address roll over. Monitor for changes with iproute to confirm.
I don't think there is IPv6 temporary privacy address roll over during UCX runing.
I disabled ipv6 with setting net.ipv6.conf.all.disable_ipv6=1
. But warning didn't disappear.
well, we can reduce the log level of these warnings to DIAG (hidden by default).
We can disable the log by setting log level. But I want to find the real reason that cause ucx to print warnings. Any other ideas or ways that I can locate the problem?
According to the previous comments, there is no obvious reason for these warnings. Please open a case for Nvidia networking support.
Ok, I have opened a case in Mellonax community. This issue can be closed. Thank you for your reply.
Also seeing this issue on a fresh Centos Stream 8 host with same UCX version installed. Warnings are output to terminal only when an MPI application is running. Using system provided Open MPI 4.1.1. Another weird thing is that MPI is using --mca btl self,vader
, so not even using the Mellanox device in this system
Describe the bug
When I run benchmark osu compiled with hpcx, I got warnings:
[1635835013.823013] [node181:6471 :async] ib_device.c:475 UCX WARN IB Async event on mlx5_0: GID table change on port 1
I have find the issue 1845. Someone said it's not a bug. Why there is warning message with hpc-x? By the way, I'm working with RoCE and Mellanox networkcard is working in Ethernet mode. Opensm is not restarted by anyone.
Steps to Reproduce
ucx_info -v
)Setup and versions
rdma-core-54mlnx1-1.54103.aarch64
MLNX_OFED_LINUX-5.4-1.0.3.0
ibstat
oribv_devinfo -vv
commandAdditional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCX