sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
739 stars 1.43k forks source link

[DHCP6RELAY] Error in logs "ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket" #9945

Closed ihorchekh closed 2 years ago

ihorchekh commented 2 years ago

Description Error in the log. Reproducing from time to time.

Steps to reproduce the issue:

  1. Run test pc/test_po_update.py py.test pc/test_po_update.py --inventory="../ansible/inventory,../ansible/veos" --host-pattern r-tigon-04 --module-path ../ansible/library/ --testbed r-tigon-04-t0-64 --testbed_file ../ansible/testbed.csv --log-cli-level debug --show-capture=no -ra --showlocals --clean-alluredir --alluredir=/tmp/allure-results --allure_server_addr="10.215.11.120" --skip_sanity

  2. Check result

Describe the results you received:

Test passed with error in logs

Jan 25 23:38:33.292623 r-tigon-04 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket

Describe the results you expected: Test passed without errors in logs

Additional information you deem important:

Output of show version:

`SONiC.202111.9-9b4d80115_Internal`

Attach debug file sudo generate_dump:

sonic_dump_r-tigon-04_20220201_031800.tar.gz

zhangyanzhao commented 2 years ago

A new lag is configured which caused this error message, can be ignored safely.

zhangyanzhao commented 2 years ago

Kelly, please help to double check whether this is expected during the test.

liat-grozovik commented 2 years ago

@kellyyeh kindly reminder to provide the needed info.

kellyyeh commented 2 years ago

I am not able to reproduce the issue. Can we confirm if this issue is still happening?

kellyyeh commented 2 years ago

@ihorchekh This error message usually occurs when there are vlan(s) not configured with ipv6 helpers. Could you please check if this is the case for this issue?

ihorchekh commented 2 years ago

@kellyyeh The test is not related to VLANs and DHCP helpers. It removes all members and IP from a port-channel and the interface itself, and in the end, returns all back. We tried to run the test again and the issue was not reproduced. But this error appears on other tests during our nightly regression:

root@arc-switch1025:/home/admin# zcat /var/log/syslog.*.gz | grep "Failed to bind to socket"
Apr 14 01:19:04.584352 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 14 01:44:17.735089 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 13 22:20:40.757450 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 13 22:24:10.163913 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 13 19:56:25.148412 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 14 05:03:49.379104 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 13 16:31:39.785202 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 13 16:36:01.893685 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 14 04:01:11.015635 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
yucgu commented 2 years ago

Hi @kellyyeh is it possible to cause the failure of the dhcp6 test case: test_dhcp_relay_after_link_flap in https://github.com/Azure/sonic-mgmt/blob/master/tests/dhcp_relay/test_dhcpv6_relay.py

the error logs for the test case shows the UDP port in the received packet is not the expected "547". as instead, it's 53658 (0xD19A), looks it's modified by the dhcp-relay. from the UDP usage dump, the dhcp-relay is using that port (53658): root@# netstat -tulpn | grep dhcp6 udp6 0 0 :::53658 ::: 9369/dhcp6relay
udp6 0 0 fc02:1000::1:547 :::
9369/dhcp6relay root@#

it doesn't always happen, when the case passes, there is no the bind failure error log, but when the case fails, we can see that bind failure error log.

Thanks

kellyyeh commented 2 years ago

Hi @yucgu, could you elaborate on the image version and platform this test failure happened on?

yucgu commented 2 years ago

Hi @kellyyeh, the image is the 202012, and the platform is cisco-8000.

However, even with the failure to bind with the standard UDP port (547), using non standard source ports are fine according to dhcp relay RFC (shown below) as long as source port sub-option is specified.

While the test script expects the UDP port always to be the 547, when the dhcp-relay uses the non-standard UDP port, should the test case code support that scenario and let the case pass?

From RFC: " 5.2. DHCPv6

The IPv6 relay agent MUST include the "Relay Source Port Option" when

it uses a non-DHCP UDP port (not 547) to communicate to a DHCPv6

server or an upstream IPv6 relay agent. Also, when an IPv6 relay

agent detects that a downstream relay agent uses a non-DHCP UDP port

in the packet, it MUST record the port number in the "Downstream

Source Port" field of this option. If this option is included to

indicate only the local non-DHCP UDP port usage and the port is not

being used by a downstream relay agent, the Downstream Source Port

field MUST be set to zero. "

kellyyeh commented 2 years ago

Thanks @yucgu for sharing your findings. We want to keep udp port to be 547 as it is the standard port. We also saw a few testcase failures during our test runs, but it is very inconsistent and cannot be reproduced when running the test manually. Could you share how you reproduced the error?

yucgu commented 2 years ago

Thanks @kellyyeh for checking it. Yes, looks it's an intermittent issue, and it's not easy to reproduce on our testbeds either, it occasionally happened on our nightly test runs, and i have to keep reloading the whole system multiple times to reproduce the issue. if it's okay after system bootup, to re-run the testcase or to reload the dhcp_relay container doesn't help reproduce it. and if the issue happens after the bootup, to restart the dhcp_relay container can resolve the issue.

is it same with the issue discussed in this topic? will the solution in progress be able to resolve the dhcp testcase failure as well? if it's not same, do we need to track it with a separate topic? thanks

yucgu commented 2 years ago

Hi @kellyyeh, is there a plan for the issue? thanks

kellyyeh commented 2 years ago

@yucgu yes the solution will be able to resolve the dhcp testcase failure. The fix is in progress