Open congh-nvidia opened 2 months ago
@XuChen-MSFT Could you please review this issue? Thanks.
@vaibhavhd @ryanzhu706 Can you please help take a look?
My question is, should we just ignore the syslog errors when neighbor_advertiser fails, or we should fix the test or the neighbor_advertiser to make sure it is always successful?
We should not ignore the errors from neighbor_advertiser that you posted in the description.
Please reach out to Xu to understand why errors were ignored.
Hi @XuChen-MSFT , could you please explain why errors were ignored in your PR(https://github.com/sonic-net/sonic-mgmt/pull/7993) ?
@XuChen-MSFT Do we see similar failure in our test? From what I saw, it looks like a testbed/network issue to me
@congh-nvidia , in before, it occasionally failed, and don't find any impact after checking. so consider it as flaky error, and then ignore it. no special reason for ignoring.
Hi @XuChen-MSFT, from my understanding, we should not ignore the neighbor_advertiser error, the neighbor_advertiser is just what this test case wants to validate. This is the comment in the test:
# checking decap after vxlan set/unset is to make sure that deletion of vxlan
# tunnel and CPA ACLs won't negatively impact ipinip tunnel & decap mechanism
# Hence a new decap config is not applied to the device in this case. This is
# to avoid creating new tables and test ipinip decap with default loaded config
So look like we should remove the error ignores and investigate why the neighbor_advertiser sometimes fails. @vaibhavhd please correct me if I'm wrong.
@vaibhavhd Can you please comment?
Issue Description
In the decap test(decap/test_decap.py), when the parameter vxlan=set_unset, it will uses the /usr/local/bin/neighbor_advertiser to set the vxlan tunnel. But sometimes the neighbor_advertiser fails and throws syslog errors which will be captured by loganalyzer:
The test fails due to the loganalyzer errors. This is the line that calls the neighbor_advertiser: https://github.com/sonic-net/sonic-mgmt/blob/master/tests/decap/test_decap.py#L188 And in PR(https://github.com/sonic-net/sonic-mgmt/pull/7993), the "module_ignore_errors = True" was added to this line, meaning sometimes the failure is expected and can be ignored. I don't quite understand this because from PR(https://github.com/sonic-net/sonic-mgmt/pull/5834), looks like we need the neighbor_advertiser to be successfully executed to make sure the test gap covered.
My question is, should we just ignore the syslog errors when neighbor_advertiser fails, or we should fix the test or the neighbor_advertiser to make sure it is always successful?
Results you see
I have made some further analysis on the failure of the neighbor_advertiser. Firstly the test starts a ferret daemon in the ptf and listens to the port 448. And then the neighbor_advertiser tries to post a http request to the ferret daemon and expect a good response. When the neighbor_advertiser fails, the tcp connection is not successfully established, the is the tcpdump on the dut, 10.210.25.49 is the dut switch, 10.215.30.182 is the ptf:
Here the connection is reset by the dut after the ACK from ferret daemon is received.
This is the tcpdump from a successful request:
Results you expected to see
The syslog errors should be ignored or the neighbor_advertiser should be always successful.
Is it platform specific
generic
Relevant log output
No response
Output of
show version
Attach files (if any)
No response