sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
200 stars 732 forks source link

[DNX][202205] testQosSaiLossyQueue: Sees RX_DRP increment when filling the VOQ #11682

Open arista-nwolfe opened 9 months ago

arista-nwolfe commented 9 months ago

testQosSaiLossyQueue fails with the following exception:

FAIL: sai_qos_tests.LossyQueueTest
----------------------------------------------------------------------
Traceback (most recent call last):
  File "saitests/py3/sai_qos_tests.py", line 3430, in runTest
    assert(recv_counters[cntr] <= recv_counters_base[cntr] + COUNTER_MARGIN)
AssertionError

It indicates that we receive RX_DRP packets when we were filling up the VOQ: recv_counters_base: 321813, recv_counters: 533016

The reason we see RX_DRPs is because the port-channel goes down while we're sending the packets. This results in the packet not having a destination and is therefor dropped.

The reason the port-channel goes down is because this test requires disabling TX on the egress port (a member of a port-channel): self.sai_thrift_port_tx_disable(self.dst_client, asic_type, [dst_port_id]) https://github.com/sonic-net/sonic-mgmt/blob/202205/tests/saitests/py3/sai_qos_tests.py#L3386 This will result in the TX LACP packets to stop egressing, so after 3 LACP packets are missed (60-90s) on the server side the LAG is torn down.

I timed how long it takes the test to send all it's packets (2,396,544) to fill up the VOQ: Sending Packets 2024-02-09 22:49:53.234339 Packets Finished 2024-02-09 22:55:25.925242 It takes over 5 minutes to send these packets so the LAG has plenty of time to LACP timeout.

I'm able to see this issue just by disabling TX and waiting:

(Pdb) self.sai_thrift_port_tx_disable(self.dst_client, asic_type, [dst_port_id])
...
Feb  9 22:35:22.592837 cmp314-3 NOTICE swss0#orchagent: :- updatePortOperStatus: Port PortChannel102 oper state set from up to down
kenneth-arista commented 9 months ago

@arlakshm