Open congh-nvidia opened 1 year ago
From the log posted, it appears that the delay was on the DUT side (API server). @congh-nvidia can you elaborate more on the reason you suspect it could also be a network delay?
@prgeor should we allow the time allowance to be increased from 2 seconds to 3 seconds?
Hi @yxieca, from the log we can see there was an unexpected delay between WatchdogImplBase.arm and WatchdogImplBase.get_remaining_time, but we can't tell where the time was spent - on the dut or on the network. We need to capture the http packets to see when it arrives at the dut. But since we can't control the http server response time and the network delay, it better to devide the api test and the watchdog functional test. And in the functional test, we can directly call the api on the dut to avoid such noise.
Description
The test cases in tests/platform_tests/api/test_watchdog.py occasionally fail in our regression, and after our debugging, we believe it is a test issue. In the platform api tests, there is a http server running on the dut. And the test running in sonic-mgmt docker sends POST to the server to invoke the platform api. This mechanism normally works with the platform api tests. But for the watchdog tests, timing is a critical factor, and the remote call mechanism cannot guarantee when the dut receives the POST and when the platform api is actually executed on the dut. And this may lead to some test failures. To debug the issue, I have added some debug info in the function do_platform_api of platform_api_server.py:
And when the test case test_periodic_arm failed, we can check the syslog on dut:
Aug 1 15:06:06.934116 r-liger-02 INFO pmon#platform_api_server.py: 2023-08-01 12:06:06.933825 Aug 1 15:06:06.934116 r-liger-02 INFO pmon#platform_api_server.py: <bound method WatchdogImplBase.arm of <sonic_platform.watchdog.WatchdogType2 object at 0x7f90f5e3f8b0>> Aug 1 15:06:06.934116 r-liger-02 INFO pmon#platform_api_server.py: 10 Aug 1 15:06:06.934515 r-liger-02 INFO pmon#supervisord: platform_api_server 10.215.13.7 - - [01/Aug/2023 12:06:06] "POST /platform/chassis/watchdog/arm HTTP/1.1" 200 - Aug 1 15:06:08.940676 r-liger-02 INFO pmon#platform_api_server.py: 2023-08-01 12:06:08.940535 Aug 1 15:06:08.940676 r-liger-02 INFO pmon#platform_api_server.py: <bound method WatchdogImplBase.get_remaining_time of <sonic_platform.watchdog.WatchdogType2 object at 0x7f90f5e3f8b0>> Aug 1 15:06:08.940676 r-liger-02 INFO pmon#platform_api_server.py: 8 Aug 1 15:06:08.940772 r-liger-02 INFO pmon#supervisord: platform_api_server 10.215.13.7 - - [01/Aug/2023 12:06:08] "POST /platform/chassis/watchdog/get_remaining_time HTTP/1.1" 200 - Aug 1 15:06:10.005622 r-liger-02 INFO pmon#platform_api_server.py: 2023-08-01 12:06:10.005395 # the time WatchdogImplBase.arm is returned Aug 1 15:06:10.005622 r-liger-02 INFO pmon#platform_api_server.py: <bound method WatchdogImplBase.arm of <sonic_platform.watchdog.WatchdogType2 object at 0x7f90f5e3f8b0>> Aug 1 15:06:10.005622 r-liger-02 INFO pmon#platform_api_server.py: 10 Aug 1 15:06:10.005894 r-liger-02 INFO pmon#supervisord: platform_api_server 10.215.13.7 - - [01/Aug/2023 12:06:10] "POST /platform/chassis/watchdog/arm HTTP/1.1" 200 - Aug 1 15:06:13.078013 r-liger-02 INFO pmon#platform_api_server.py: 2023-08-01 12:06:13.077748 # the time WatchdogImplBase.get_remaining_time is returned Aug 1 15:06:13.078013 r-liger-02 INFO pmon#platform_api_server.py: <bound method WatchdogImplBase.get_remaining_time of <sonic_platform.watchdog.WatchdogType2 object at 0x7f90f5e3f8b0>> Aug 1 15:06:13.078013 r-liger-02 INFO pmon#platform_api_server.py: 7 Aug 1 15:06:13.078290 r-liger-02 INFO pmon#supervisord: platform_api_server 10.215.13.7 - - [01/Aug/2023 12:06:13] "POST /platform/chassis/watchdog/get_remaining_time HTTP/1.1" 200 -
We can see the WatchdogImplBase.get_remaining_time was called right after the WatchdogImplBase.arm in the test case(https://github.com/sonic-net/sonic-mgmt/blob/master/tests/platform_tests/api/test_watchdog.py#L169-L170) but the time of the two APIs got returned were 12:06:10.005395 and 12:06:13.077748, and there was an 3 seconds unexpected delay which caused the test to fail:
This delay could be caused by the http server or the network, it's difficult to debug and control it. And there may be longer delays which could even cause some unexpected reboot(not disarm the watchdog in time, keep seeing this in our regression). And also, we found that there is no specific test for the watchdog cli command(watchdogutil).
We are planning to update the watchdog test by:
If there are any concerns, please share your comments, thanks.
Steps to reproduce the issue:
Additional information you deem important: