openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 427 forks source link

"CM private data buffer is too small to pack UCP endpoint info" - a warning that should be heeded? #9368

Closed rrgargeya closed 1 year ago

rrgargeya commented 1 year ago

Describe the bug

I see the messages "wireup_cm.c:333 UCX DIAG CM private data buffer is too small to pack UCP endpoint info, ep 0xXXX service data version 0, size 11, address length 70, cm 0xXXXX max_conn_priv 54"

Is this a message a bug or is it a warning to which one should pay attention?

Steps to Reproduce

UCX_LOG_LEVEL=INFO

Setup and versions

$ ibstat CA 'mlx5_0' CA type: MT4119 Number of ports: 1 Firmware version: 16.32.1010 Hardware version: 0 Node GUID: 0xec0d9a0300431604 System image GUID: 0xec0d9a0300431604 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 17 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0xec0d9a0300431604 Link layer: InfiniBand CA 'mlx5_1' CA type: MT4119 Number of ports: 1 Firmware version: 16.32.1010 Hardware version: 0 Node GUID: 0xec0d9a0300431605 System image GUID: 0xec0d9a0300431604 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 104 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0xec0d9a0300431605 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4119 Number of ports: 1 Firmware version: 16.32.1010 Hardware version: 0 Node GUID: 0xec0d9a0300431624 System image GUID: 0xec0d9a0300431624 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xee0d9afffe431624 Link layer: Ethernet CA 'mlx5_3' CA type: MT4119 Number of ports: 1 Firmware version: 16.32.1010 Hardware version: 0 Node GUID: 0xec0d9a0300431625 System image GUID: 0xec0d9a0300431624 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xee0d9afffe431625 Link layer: Ethernet

Additional information (depending on the issue)

brminich commented 1 year ago

Is there any failure or another symptom? This message is not a bug, because it is diag trace (not an error). UCX is able to shrink the address when this situation happens.

rrgargeya commented 1 year ago

No, there is no failure. But I was wondering if the DIAG trace is indicating a possible problem and if I should do something about it. From your response am I to gather that this trace message is harmless and UCX is taking counter measures?