sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
194 stars 713 forks source link

[dhcp-relay] Test fails with Random Source Port #2381

Open irene-pan1202 opened 3 years ago

irene-pan1202 commented 3 years ago

Description DHCP relay test fails with Random Source Port. What usage scenarios need to change the source port to a non-standard port? As described in pull request https://github.com/Azure/sonic-mgmt/pull/2136 SNAT feature will change source port. Why does SNAT change the dhcp packet? Not only for ip packet? Does this test case belong to common?

Steps to reproduce the issue:

  1. Run Tests With Random Source Port

Describe the results you received:

 "E                   \"dhcp_relay_test.DHCPTest ... FAIL\", ", 
 "E                   \"\", ", 
 "E                   \"======================================================================\", ", 
 "E                   \"FAIL: dhcp_relay_test.DHCPTest\", ", 
 "E                   \"----------------------------------------------------------------------\", ", 
 "E                   \"Traceback (most recent call last):\", ", 
 "E                   \"  File \\\"ptftests/dhcp_relay_test.py\\\", line 541, in runTest\", ", 
 "E                   \"    self.verify_relayed_discover()\", ", 
 "E                   \"  File \\\"ptftests/dhcp_relay_test.py\\\", line 423, in verify_relayed_discover\", ", 
 "E                   \"    \\\"Failed: Discover count of %d != %d\\\" % (discover_count, num_expected_packets))\", ", 
 "E                   \"AssertionError: Failed: Discover count of 0 != 1\", ", 

Describe the results you expected:

Additional information you deem important:

tahmed-dev commented 3 years ago

@irene-pan1202 Thanks for reporting this issue. Can you please attach the tech support? Also, what are the ACL ruled defined on the DUT when the test failed?

irene-pan1202 commented 3 years ago

@tahmed-dev the ACL rule defined source port is standard port 67 or 68, the test failed due to the DUT can't receive dhcp packet with non-standard port.

tahmed-dev commented 3 years ago

@irene-pan1202 the ACl rules should check for dest port only and not src port. There was change here on master to remove the ACL on src port. Which version are you running? Also, can you please provide the output of sudo iptables -S from your DUT?

noaOrMlnx commented 3 years ago

@tahmed-dev This test fails with same reason on 201911 image with hash 500395c.

tahmed-dev commented 3 years ago

@noaOrMlnx, can you please share the output of sudo iptables -S from the DUT where the test failed? Also, please share the route on the same DUT.

noaOrMlnx commented 3 years ago

@tahmed-dev

-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-A INPUT -s 127.0.0.1/32 -i lo -j ACCEPT
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -m icmp --icmp-type 8 -j ACCEPT
-A INPUT -p icmp -m icmp --icmp-type 0 -j ACCEPT
-A INPUT -p icmp -m icmp --icmp-type 3 -j ACCEPT
-A INPUT -p icmp -m icmp --icmp-type 11 -j ACCEPT
-A INPUT -p udp -m udp --dport 67:68 -j ACCEPT
-A INPUT -p udp -m udp --dport 546:547 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 179 -j ACCEPT
-A INPUT -p tcp -m tcp --sport 179 -j ACCEPT
-A INPUT -d 10.1.0.32/32 -j DROP
-A INPUT -d 10.210.24.0/32 -j DROP
-A INPUT -d 192.168.0.1/32 -j DROP
-A INPUT -d 10.0.0.62/32 -j DROP
-A INPUT -d 10.0.0.58/32 -j DROP
-A INPUT -d 10.0.0.60/32 -j DROP
-A INPUT -d 10.0.0.56/32 -j DROP
-A INPUT -d 10.10.10.0/32 -j DROP
-A INPUT -m ttl --ttl-lt 2 -j ACCEPT
tahmed-dev commented 3 years ago

Thanks @noaOrMlnx , I see that the accept rule are based upon the dport only. Does DHCP test cases with standard sports 67/68 pass on the same DUT? The reason I ask is that a failure can also be due to lack of bgp routes where dhcp packets could potentially travel through mgmt interface. This could be simply checked using show ip route <DHCP server IP> and also show ip bgp sum to validate packet will travel via uplink interfaces.

liat-grozovik commented 3 years ago

@noaOrMlnx and @ayurkiv-nvda can you please reply to the above? we still see this failure from time to time

stephengzh commented 3 years ago

@tahmed-dev Hi, I met the same issue as both of theirs. My iptables, show ip route and show ip bgp sum. Please check! Cheers! (master 595)

root@str-msn2700-01:/home/admin# iptables -S -P INPUT ACCEPT -P FORWARD ACCEPT -P OUTPUT ACCEPT -A INPUT -s 127.0.0.1/32 -i lo -j ACCEPT -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT -A INPUT -p icmp -m icmp --icmp-type 8 -j ACCEPT -A INPUT -p icmp -m icmp --icmp-type 0 -j ACCEPT -A INPUT -p icmp -m icmp --icmp-type 3 -j ACCEPT -A INPUT -p icmp -m icmp --icmp-type 11 -j ACCEPT -A INPUT -p udp -m udp --dport 67:68 -j ACCEPT -A INPUT -p udp -m udp --dport 546:547 -j ACCEPT -A INPUT -p tcp -m tcp --dport 179 -j ACCEPT -A INPUT -p tcp -m tcp --sport 179 -j ACCEPT -A INPUT -d 10.1.0.32/32 -j DROP -A INPUT -d 172.28.55.0/32 -j DROP -A INPUT -d 192.168.0.1/32 -j DROP -A INPUT -d 10.0.0.58/32 -j DROP -A INPUT -d 10.0.0.60/32 -j DROP -A INPUT -d 10.0.0.56/32 -j DROP -A INPUT -d 10.0.0.62/32 -j DROP -A INPUT -m ttl --ttl-lt 2 -j ACCEPT root@str-msn2700-01:/home/admin# show ip route 192.0.0.2 Routing entry for 0.0.0.0/0 Known via "bgp", distance 20, metric 0, best Last update 00:00:05 ago

Routing entry for 0.0.0.0/0 Known via "static", distance 200, metric 0 Last update 00:15:13 ago 172.28.55.1, via eth0, weight 1

root@str-msn2700-01:/home/admin# show ip route 192.0.0.1 Routing entry for 0.0.0.0/0 Known via "bgp", distance 20, metric 0, best Last update 00:00:09 ago

Routing entry for 0.0.0.0/0 Known via "static", distance 200, metric 0 Last update 00:15:17 ago 172.28.55.1, via eth0, weight 1

root@str-msn2700-01:/home/admin# show ip route 192.0.0.3 Routing entry for 0.0.0.0/0 Known via "bgp", distance 20, metric 0, best Last update 00:00:12 ago

Routing entry for 0.0.0.0/0 Known via "static", distance 200, metric 0 Last update 00:15:20 ago 172.28.55.1, via eth0, weight 1

root@str-msn2700-01:/home/admin# show ip route 192.0.0.4 Routing entry for 0.0.0.0/0 Known via "bgp", distance 20, metric 0, best Last update 00:00:13 ago

Routing entry for 0.0.0.0/0 Known via "static", distance 200, metric 0 Last update 00:15:21 ago 172.28.55.1, via eth0, weight 1

root@str-msn2700-01:/home/admin# show ip bgp sum

IPv4 Unicast Summary: BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0 BGP table version 140277 RIB entries 12807, using 2458944 bytes of memory Peers 4, using 87264 KiB of memory Peer groups 4, using 256 bytes of memory

Neighbhor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd NeighborName


10.0.0.57 4 64600 3496 124791 0 0 62736 00:15:12 6400 ARISTA01T1 10.0.0.59 4 64600 21996 54945 0 0 4834 00:00:51 6400 ARISTA02T1 10.0.0.61 4 64600 38340 67483 0 0 7261 00:01:02 6400 ARISTA03T1 10.0.0.63 4 64600 38587 49940 0 0 3868 00:00:16 6400 ARISTA04T1

stephengzh commented 3 years ago

More detailed, in file https://github.com/Azure/sonic-mgmt/blob/master/ansible/roles/test/files/ptftests/dhcp_relay_test.py#L539. Here has four verifications which match the dhcp_relay. I try to comment some verifications to check, something interesting. The verification part with mark "{failed}" will fail, the rest two verifications with mark "{pass}" will pass.

def runTest(self):
    self.client_send_discover(self.dest_mac_address, self.client_udp_src_port)

{failed} self.verify_relayed_discover() self.server_send_offer() {pass} self.verify_offer_received() self.client_send_request(self.dest_mac_address, self.client_udp_src_port) {failed} self.verify_relayed_request() self.server_send_ack() {pass} self.verify_ack_received()

shlomibitton commented 2 years ago

Hi @tahmed-dev,

I analyzed this issue and the root cause for it is a wrong route to the DHCP server from the uplink interfaces. When toggling the uplink interfaces as part of some test cases (test_dhcp_relay_after_link_flap or test_dhcp_relay_start_with_uplinks_down) the routes are removed and then re-created. The default route should be from the PortChannels or any L3 interfaces but sometimes the route to the DHCP servers is from the mgmt interface, even though it should change back to the PortChannels after they go up. I added a piece of code on the test to verify the routes to all servers after toggling the interfaces is from the PortChannels, if it is not, fail the test and print out the route in the kernel. Important to mention, the test is waiting 40 seconds after toggling the interface and the validation I added is sampling the routes for 120 seconds, only if after 120+40 seconds the route to the server is not by one of the PortChannels I failed the test. This is sometimes reproduced and sometimes it is not, let’s say probability of 1/20 runs.

The code I added:

ROUTES_RECOVER_TIMEOUT = 120

ROUTES_RECOVER_WAIT_INTERVAL = 10

def validate_dut_routes_recovered(duthost, dhcp_servers, uplink_interfaces):

    for dhcp_server in dhcp_servers:

        route_recover_attempt = 0

        while route_recover_attempt < ROUTES_RECOVER_TIMEOUT/ROUTES_RECOVER_WAIT_INTERVAL:

            result = duthost.shell("ip route get {}".format(dhcp_server))

            assert result['rc'] == 0, "Failed to get ip routes from DUT"

            if any(uplink_interface in str(result['stdout']) for uplink_interface in uplink_interfaces):

                break

            else:

                route_recover_attempt += 1

                assert route_recover_attempt < ROUTES_RECOVER_TIMEOUT/ROUTES_RECOVER_WAIT_INTERVAL,\

                       "Failed to recover route to DHCP server: {}".format(dhcp_server)

                time.sleep(ROUTES_RECOVER_WAIT_INTERVAL)

And the assertion I get: AssertionError: Failed to recover route to DHCP server: 192.0.0.4