thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
221 stars 55 forks source link

flaky test: custom_sub_topics_tedge-mapper-az.robot #2799

Closed Bravo555 closed 7 months ago

Bravo555 commented 7 months ago

Describe the bug

Running the test suite on my machine always fails.

To Reproduce

  1. Run custom_sub_topics_tedge-mapper-az.robot suite directly.
  2. The second test (Publish measurements to unsubscribed topic) will fail.

Expected behavior The second test should succeed.

Additional context

I've found following differences between report from the CI, and tests on my machine:

CI:

image

my machine:

image

For some reason, mqtt-logger invocation on my machine is repeated about 16 times, with a delay of 2 second between each retry. Incidentally, I found a 2 second delay in paramiko, the SSH library used for connecting to the docker test device:

image

I wonder if for some reason I happen to be hitting this timeout.

Also found that if the keyword succeeds, then the command is executed only once. Looks like it's retried if the condition fails, but I can't find where the code for this logic is placed.

reubenmiller commented 7 months ago

Giving a bit of background here on the test framework...

The test frameworks (both the robotframework-c8y and robotframework-devicelibrary use "smart assertions", where an assertion is automatically retried x amount of times until a total duration (timeout) has been used. System tests are generally very flakey as they rely on interactions with real servers where the performance can vary (e.g. a device can be registered in the platform anywhere between 1s to 10s, depending on how loaded the server is...(not exact numbers but more to illustrate the point). So having the smart assertions helps cut down on this flakiness, to make the overall system tests more reliable.

You can think of the "smart assertions" as "eventually I expect this condition to be true" rather than "the condition should be true now!".

The retry mechanism which coordinates how often the failed assertion is retried is defined for the robotframework-devicelibrary, here: https://github.com/reubenmiller/device-test-core/blob/main/device_test_core/retry.py.

So back to the actual problem you are facing...

If you look at the test output, the mqtt-logger is still finding an MQTT message within in the search space even with an explicit 5 second delay between publishing the last measurement...so there are two explanations for this:

  1. tedge-mapper-c8y is taking more than 5 seconds to process MQTT message (very unlikely imo)
  2. The test assertion is looking too far back from the test border, and it is seeing a message published from the previous test case (as the default dateFrom value used by the Should Have MQTT Messages assertion uses the timestamp of the start of the test, rounded down to the nearest second, and since the message you are the assertion is detecting is 0.6 seconds paste the rounded down time, it would suggested that the rounded down time is the main issue.

You have a few options to fix this, but the core of the problem comes down to the fact that the tests cases are not isolated enough from each other (as the tests are using a suite setup instead of a test setup).

You could solve this by being more explicit with the dateFrom that is being used in the assertion:

Option 1: Use filter on the mqtt message assertion

Should Have MQTT Messages    az/messages/events/#   minimum=0    maximum=0    message_contains=temperature

Option 2: Be explicit with the time range used in the mqtt message assertion

Publish measurements to unsubscribed topic
    ${date_from}=    Get Unix Timestamp
    Execute Command    tedge mqtt pub te/device/main///m/ '{"temperature": 10}'
    Sleep    5s    reason=If a message is not published in 5s, it will never be published.
    Should Have MQTT Messages    az/messages/events/#   minimum=0    maximum=0    date_from=${date_from}
Bravo555 commented 7 months ago

The test frameworks (both the robotframework-c8y and robotframework-devicelibrary use "smart assertions", where an assertion is automatically retried x amount of times until a total duration (timeout) has been used.

Ah, that makes sense, the logic for the retry was not in Should Have MQTT Messages keyword itself, but used to configure how Robot Framework executes the keywords. Regardless, the rounding of the timestamp leading to inclusion of previous message is also something i suspected, and your examples show this will be an easy fix. Thanks!