Closed ihorchekh closed 2 years ago
A new lag is configured which caused this error message, can be ignored safely.
Kelly, please help to double check whether this is expected during the test.
@kellyyeh kindly reminder to provide the needed info.
I am not able to reproduce the issue. Can we confirm if this issue is still happening?
@ihorchekh This error message usually occurs when there are vlan(s) not configured with ipv6 helpers. Could you please check if this is the case for this issue?
@kellyyeh The test is not related to VLANs and DHCP helpers. It removes all members and IP from a port-channel and the interface itself, and in the end, returns all back. We tried to run the test again and the issue was not reproduced. But this error appears on other tests during our nightly regression:
root@arc-switch1025:/home/admin# zcat /var/log/syslog.*.gz | grep "Failed to bind to socket"
Apr 14 01:19:04.584352 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 14 01:44:17.735089 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 13 22:20:40.757450 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 13 22:24:10.163913 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 13 19:56:25.148412 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 14 05:03:49.379104 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 13 16:31:39.785202 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 13 16:36:01.893685 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Apr 14 04:01:11.015635 arc-switch1025 ERR dhcp_relay#dhcp6relay: bind: Failed to bind to socket
Hi @kellyyeh is it possible to cause the failure of the dhcp6 test case: test_dhcp_relay_after_link_flap in https://github.com/Azure/sonic-mgmt/blob/master/tests/dhcp_relay/test_dhcpv6_relay.py
the error logs for the test case shows the UDP port in the received packet is not the expected "547". as instead, it's 53658 (0xD19A), looks it's modified by the dhcp-relay.
from the UDP usage dump, the dhcp-relay is using that port (53658):
root@# netstat -tulpn | grep dhcp6
udp6 0 0 :::53658 ::: 9369/dhcp6relay
udp6 0 0 fc02:1000::1:547 ::: 9369/dhcp6relay
root@#
it doesn't always happen, when the case passes, there is no the bind failure error log, but when the case fails, we can see that bind failure error log.
Thanks
Hi @yucgu, could you elaborate on the image version and platform this test failure happened on?
Hi @kellyyeh, the image is the 202012, and the platform is cisco-8000.
However, even with the failure to bind with the standard UDP port (547), using non standard source ports are fine according to dhcp relay RFC (shown below) as long as source port sub-option is specified.
While the test script expects the UDP port always to be the 547, when the dhcp-relay uses the non-standard UDP port, should the test case code support that scenario and let the case pass?
From RFC: " 5.2. DHCPv6
The IPv6 relay agent MUST include the "Relay Source Port Option" when
it uses a non-DHCP UDP port (not 547) to communicate to a DHCPv6
server or an upstream IPv6 relay agent. Also, when an IPv6 relay
agent detects that a downstream relay agent uses a non-DHCP UDP port
in the packet, it MUST record the port number in the "Downstream
Source Port" field of this option. If this option is included to
indicate only the local non-DHCP UDP port usage and the port is not
being used by a downstream relay agent, the Downstream Source Port
field MUST be set to zero. "
Thanks @yucgu for sharing your findings. We want to keep udp port to be 547 as it is the standard port. We also saw a few testcase failures during our test runs, but it is very inconsistent and cannot be reproduced when running the test manually. Could you share how you reproduced the error?
Thanks @kellyyeh for checking it. Yes, looks it's an intermittent issue, and it's not easy to reproduce on our testbeds either, it occasionally happened on our nightly test runs, and i have to keep reloading the whole system multiple times to reproduce the issue. if it's okay after system bootup, to re-run the testcase or to reload the dhcp_relay container doesn't help reproduce it. and if the issue happens after the bootup, to restart the dhcp_relay container can resolve the issue.
is it same with the issue discussed in this topic? will the solution in progress be able to resolve the dhcp testcase failure as well? if it's not same, do we need to track it with a separate topic? thanks
Hi @kellyyeh, is there a plan for the issue? thanks
@yucgu yes the solution will be able to resolve the dhcp testcase failure. The fix is in progress
Description Error in the log. Reproducing from time to time.
Steps to reproduce the issue:
Run test pc/test_po_update.py
py.test pc/test_po_update.py --inventory="../ansible/inventory,../ansible/veos" --host-pattern r-tigon-04 --module-path ../ansible/library/ --testbed r-tigon-04-t0-64 --testbed_file ../ansible/testbed.csv --log-cli-level debug --show-capture=no -ra --showlocals --clean-alluredir --alluredir=/tmp/allure-results --allure_server_addr="10.215.11.120" --skip_sanity
Check result
Describe the results you received:
Test passed with error in logs
Describe the results you expected: Test passed without errors in logs
Additional information you deem important:
Output of
show version
:Attach debug file
sudo generate_dump
:sonic_dump_r-tigon-04_20220201_031800.tar.gz