thin-edge / thin-edge.io

The open edge framework for lightweight IoT devices
https://thin-edge.io
Apache License 2.0
218 stars 54 forks source link

flaky unittest: bridge_reconnects_successfully_after_local_connection_interrupted #3021

Open reubenmiller opened 1 month ago

reubenmiller commented 1 month ago

Describe the bug

Flaky unit tests:

Though a workaround has been added in a PR, which adds a sleep before interrupting the connection.

Fail

Below shows an example when the test fails.

    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.50s
    Starting 1 test across 2 binaries (52 skipped; run ID: 22061f7c-17a5-42de-a646-e3e1ac124df8, nextest profile: default)
       START             tedge_mqtt_bridge::bridge bridge_reconnects_successfully_after_local_connection_interrupted

running 1 test
TEST: local_broker_port: 59103
TEST: cloud_broker_port: 59104
[2024-07-29T11:17:56Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to cloud broker
[2024-07-29T11:17:56Z INFO  tedge_mqtt_bridge] Bridge cloud connection "cloud" subscribing to [Filter = s/ds, Qos = AtLeastOnce]
[2024-07-29T11:17:56Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to local broker
[2024-07-29T11:17:56Z INFO  tedge_mqtt_bridge] Bridge cloud connection "local" subscribing to [Filter = c8y/s/us, Qos = AtLeastOnce]
[2024-07-29T11:17:56Z ERROR tedge_mqtt_bridge::health] MQTT bridge failed to connect to local broker: Mqtt state: Io error: Custom { kind: ConnectionAborted, error: "connection closed by peer" }
[2024-07-29T11:17:56Z ERROR tedge_mqtt_bridge::health] MQTT bridge failed to connect to local broker: I/O: connection closed by peer
[2024-07-29T11:17:56Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to local broker
[2024-07-29T11:17:56Z INFO  tedge_mqtt_bridge] Bridge cloud connection "local" subscribing to [Filter = c8y/s/us, Qos = AtLeastOnce]
thread 'bridge_reconnects_successfully_after_local_connection_interrupted' panicked at crates/extensions/tedge_mqtt_bridge/tests/bridge.rs:241:58:
called `Result::unwrap()` on an `Err` value: timed-out waiting for received message

Caused by:
    deadline has elapsed
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test bridge_reconnects_successfully_after_local_connection_interrupted ... FAILED

failures:

failures:
    bridge_reconnects_successfully_after_local_connection_interrupted

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 3 filtered out; finished in 5.06s

        FAIL [   5.064s] tedge_mqtt_bridge::bridge bridge_reconnects_successfully_after_local_connection_interrupted
   Canceling due to test failure
------------
     Summary [   5.065s] 1 test run: 0 passed, 1 failed, 52 skipped
        FAIL [   5.064s] tedge_mqtt_bridge::bridge bridge_reconnects_successfully_after_local_connection_interrupted
error: test run failed

Pass

Below shows an example when the test passes.

    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.47s
    Starting 1 test across 2 binaries (52 skipped; run ID: 8c895c93-30fb-4ea6-98c1-2dc2cd7d9299, nextest profile: default)
       START             tedge_mqtt_bridge::bridge bridge_reconnects_successfully_after_local_connection_interrupted

running 1 test
TEST: local_broker_port: 59091
TEST: cloud_broker_port: 59092
[2024-07-29T11:17:55Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to cloud broker
[2024-07-29T11:17:55Z INFO  tedge_mqtt_bridge] Bridge cloud connection "cloud" subscribing to [Filter = s/ds, Qos = AtLeastOnce]
[2024-07-29T11:17:55Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to local broker
[2024-07-29T11:17:55Z INFO  tedge_mqtt_bridge] Bridge cloud connection "local" subscribing to [Filter = c8y/s/us, Qos = AtLeastOnce]
[2024-07-29T11:17:55Z ERROR tedge_mqtt_bridge::health] MQTT bridge failed to connect to local broker: Mqtt state: Io error: Custom { kind: ConnectionAborted, error: "connection closed by peer" }
[2024-07-29T11:17:55Z ERROR tedge_mqtt_bridge::health] MQTT bridge failed to connect to local broker: I/O: Connection reset by peer (os error 54)
[2024-07-29T11:17:55Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to local broker
[2024-07-29T11:17:55Z INFO  tedge_mqtt_bridge] Bridge cloud connection "local" subscribing to [Filter = c8y/s/us, Qos = AtLeastOnce]
test bridge_reconnects_successfully_after_local_connection_interrupted ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 3 filtered out; finished in 0.05s

        PASS [   0.056s] tedge_mqtt_bridge::bridge bridge_reconnects_successfully_after_local_connection_interrupted
------------
     Summary [   0.056s] 1 test run: 1 passed, 52 skipped

Notes

To Reproduce

A flaky test can be reproduced by re-running the tests until it fails. Sometimes it takes more than ~30 attempts

RUNS=0; while cargo nextest run -p tedge_mqtt_bridge "bridge_reconnects_successfully_after_local_connection_interrupted" --nocapture; do RUNS=$((RUNS+1)) ; echo repeating test run $RUNS; done; echo "Failed after $RUNS runs"

Expected behavior

The test should not fail sporadically.

Screenshots

Environment (please complete the following information):

Additional context

reubenmiller commented 1 month ago

https://github.com/thin-edge/thin-edge.io/pull/3027 introduced running the Rust tests using cargo-nextest which allows marking specific tests as flaky so they are automatically retried on failure, as this will hopefully be a reliable workaround (as the fixed sleep workaround also sporadically failed).

Below is the new console output showing the flakiness of the bridge_reconnects_successfully_after_local_connection_interrupted test:

Caused by:
    0: timed-out waiting for received message
    1: deadline has elapsed

   RETRY 2/5 [         ] tedge_mqtt_bridge::bridge bridge_reconnects_successfully_after_local_connection_interrupted
  TRY 2 PASS [   0.264s] tedge_mqtt_bridge::bridge bridge_reconnects_successfully_after_local_connection_interrupted
------------
     Summary [  31.118s] 1294 tests run: 1294 passed (1 flaky, 1 leaky), 3 skipped
   FLAKY 2/5 [   0.264s] tedge_mqtt_bridge::bridge bridge_reconnects_successfully_after_local_connection_interrupted
    Finished report saved to codecov.json

Reference CI Run: