sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
195 stars 715 forks source link

The platform API watchdog tests occasionally fail due to some delay when remotely calling the APIs #9237

Open congh-nvidia opened 1 year ago

congh-nvidia commented 1 year ago

Description

The test cases in tests/platform_tests/api/test_watchdog.py occasionally fail in our regression, and after our debugging, we believe it is a test issue. In the platform api tests, there is a http server running on the dut. And the test running in sonic-mgmt docker sends POST to the server to invoke the platform api. This mechanism normally works with the platform api tests. But for the watchdog tests, timing is a critical factor, and the remote call mechanism cannot guarantee when the dut receives the POST and when the platform api is actually executed on the dut. And this may lead to some test failures. To debug the issue, I have added some debug info in the function do_platform_api of platform_api_server.py:

    def do_platform_api(self):
        ...
        ...
        try:
            res = getattr(obj, api)(*args)
            # Print some info in syslog for debugging
            syslog.syslog(syslog.LOG_INFO, str(datetime.datetime.now()))  # print current time
            syslog.syslog(syslog.LOG_INFO, str(getattr(obj, api)))  # print which api is called
            syslog.syslog(syslog.LOG_INFO, str(res))  # print the return value
        except NotImplementedError:
            syslog.syslog(syslog.LOG_WARNING, "API '{}' not implemented".format(api))

        self.send_response(200)
        self.send_header('Content-Type', 'application/json')
        self.end_headers()
        self.wfile.write(json.dumps({'res': res}, default=obj_serialize).encode('utf-8'))

And when the test case test_periodic_arm failed, we can check the syslog on dut:

Aug 1 15:06:06.934116 r-liger-02 INFO pmon#platform_api_server.py: 2023-08-01 12:06:06.933825 Aug 1 15:06:06.934116 r-liger-02 INFO pmon#platform_api_server.py: <bound method WatchdogImplBase.arm of <sonic_platform.watchdog.WatchdogType2 object at 0x7f90f5e3f8b0>> Aug 1 15:06:06.934116 r-liger-02 INFO pmon#platform_api_server.py: 10 Aug 1 15:06:06.934515 r-liger-02 INFO pmon#supervisord: platform_api_server 10.215.13.7 - - [01/Aug/2023 12:06:06] "POST /platform/chassis/watchdog/arm HTTP/1.1" 200 - Aug 1 15:06:08.940676 r-liger-02 INFO pmon#platform_api_server.py: 2023-08-01 12:06:08.940535 Aug 1 15:06:08.940676 r-liger-02 INFO pmon#platform_api_server.py: <bound method WatchdogImplBase.get_remaining_time of <sonic_platform.watchdog.WatchdogType2 object at 0x7f90f5e3f8b0>> Aug 1 15:06:08.940676 r-liger-02 INFO pmon#platform_api_server.py: 8 Aug 1 15:06:08.940772 r-liger-02 INFO pmon#supervisord: platform_api_server 10.215.13.7 - - [01/Aug/2023 12:06:08] "POST /platform/chassis/watchdog/get_remaining_time HTTP/1.1" 200 - Aug 1 15:06:10.005622 r-liger-02 INFO pmon#platform_api_server.py: 2023-08-01 12:06:10.005395 # the time WatchdogImplBase.arm is returned Aug 1 15:06:10.005622 r-liger-02 INFO pmon#platform_api_server.py: <bound method WatchdogImplBase.arm of <sonic_platform.watchdog.WatchdogType2 object at 0x7f90f5e3f8b0>> Aug 1 15:06:10.005622 r-liger-02 INFO pmon#platform_api_server.py: 10 Aug 1 15:06:10.005894 r-liger-02 INFO pmon#supervisord: platform_api_server 10.215.13.7 - - [01/Aug/2023 12:06:10] "POST /platform/chassis/watchdog/arm HTTP/1.1" 200 - Aug 1 15:06:13.078013 r-liger-02 INFO pmon#platform_api_server.py: 2023-08-01 12:06:13.077748 # the time WatchdogImplBase.get_remaining_time is returned Aug 1 15:06:13.078013 r-liger-02 INFO pmon#platform_api_server.py: <bound method WatchdogImplBase.get_remaining_time of <sonic_platform.watchdog.WatchdogType2 object at 0x7f90f5e3f8b0>> Aug 1 15:06:13.078013 r-liger-02 INFO pmon#platform_api_server.py: 7 Aug 1 15:06:13.078290 r-liger-02 INFO pmon#supervisord: platform_api_server 10.215.13.7 - - [01/Aug/2023 12:06:13] "POST /platform/chassis/watchdog/get_remaining_time HTTP/1.1" 200 -

We can see the WatchdogImplBase.get_remaining_time was called right after the WatchdogImplBase.arm in the test case(https://github.com/sonic-net/sonic-mgmt/blob/master/tests/platform_tests/api/test_watchdog.py#L169-L170) but the time of the two APIs got returned were 12:06:10.005395 and 12:06:13.077748, and there was an 3 seconds unexpected delay which caused the test to fail:

Failed: test_periodic_arm: new remaining timeout 7 seconds should be greater than the previous remaining timeout 8 seconds by 2 seconds

This delay could be caused by the http server or the network, it's difficult to debug and control it. And there may be longer delays which could even cause some unexpected reboot(not disarm the watchdog in time, keep seeing this in our regression). And also, we found that there is no specific test for the watchdog cli command(watchdogutil).

We are planning to update the watchdog test by:

  1. Implement new test cases by using the watchdogutil cli, which will also cover the api functions for they are invoked in the cli.
  2. In the new cases, run a shell script on the dut for the test body to avoid the network delay, and then the http server is not needed.
  3. Simplify the watchdog api cases to only test the basic input/output of the api, move all the functional flows to the new cases.

If there are any concerns, please share your comments, thanks.

Steps to reproduce the issue:

  1. Run the test tests/platform_tests/api/test_watchdog.py in a loop, the reproduction change is about 1/15 on our platforms.

Additional information you deem important:

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
yxieca commented 1 year ago

From the log posted, it appears that the delay was on the DUT side (API server). @congh-nvidia can you elaborate more on the reason you suspect it could also be a network delay?

@prgeor should we allow the time allowance to be increased from 2 seconds to 3 seconds?

congh-nvidia commented 1 year ago

Hi @yxieca, from the log we can see there was an unexpected delay between WatchdogImplBase.arm and WatchdogImplBase.get_remaining_time, but we can't tell where the time was spent - on the dut or on the network. We need to capture the http packets to see when it arrives at the dut. But since we can't control the http server response time and the network delay, it better to devide the api test and the watchdog functional test. And in the functional test, we can directly call the api on the dut to avoid such noise.