☂ Fix flaky tests - Githubissues

ggwpez commented 2 years ago

Some tests just randomly fail in the CI.

Known bad with last confirmation date:

[ ] follow_report_multiple_pruned_block https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2540531
[ ] follow_forks_pruned_block https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2529507
[ ] returns_status_for_pruned_blocks https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2522151
[ ] can_sync_small_non_best_forks https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2635997
[ ] telemetry_works on commit 890451221db37176e13cb1a306246f02de80590a assertion failed: self.wait().unwrap().success() (see comment)
[x] beefy_reports_equivocations CI 2023-06-14 - fix https://github.com/paritytech/substrate/pull/14382
[ ] response_headers_invalid_call CI 2023-05-29
[x] ensure_parallel_execution: Expected duration 6697ms to be less than 6000ms (2023-06-13) - fix paritytech/polkadot#7390
[x] execute_queue_doesnt_stall_with_varying_executor_params: Expected duration 12912ms to be less than or equal to 12000ms (2023-06-13) - fix paritytech/polkadot#7390
[ ] sc-offchain api::http::tests::response_header_invalid_call CI log (2023-06-16)

Maybe flaky, maybe fixed :ghost::

[ ] running_the_node_works_and_can_be_interrupted error-log.txt 8a9f48bcf0c9f92949082535d77c12166522bb2f
[ ] notifications_back_pressure https://github.com/paritytech/polkadot-sdk/issues/537

Fixed:

[x] temp_base_path_works https://github.com/paritytech/substrate/pull/13505 error-log.txt
[x] subscribe_and_unsubscribe_to_justifications
[x] syncs_header_only_forks https://github.com/paritytech/substrate/issues/12607
[x] babe authoring_blocks https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/2101500 https://github.com/paritytech/substrate/pull/13199

niklasad1 commented 2 years ago

Ok, I see.

This might be quite tricky find "free" ports to use for libp2p. A first step would be to ensure that the CLI tests assigns unique ports for libp2p.

ggwpez commented 1 year ago

The babe test authoring_blocks failed again.
tests::authoring_blocks' panicked at 'importing block failed: ClientImport("Slot number must increase: parent slot: 1669811014, this slot: 1669811014")

bkchr commented 1 year ago

telemetry_works seems to be another flaky test.

When running:

while cargo test --release -p node-cli --test telemetry; do true; done

The test will fail at some point. When adding some more "debugging" the following error is shown:

[test-utils/cli/src/lib.rs:236] self.wait().unwrap() = ExitStatus(
    unix_wait_status(
        139,
    ),
)

This indicates at the spawned Substrate process is dying because of some segmentation fault. I assume the underlying problem is not related to telemetry, as it happens on shutting down the node. (Maybe still related to telemetry and only happens because the worker is doing something that it shouldn't be doing)

ggwpez commented 1 year ago

telemetry_works seems to be another flaky test.

Confirmed and added. Should we comment it until fixed?

bkchr commented 1 year ago

Confirmed and added. Should we comment it until fixed?

I don't have seen it in CI so far, only locally on my machine. Also in debug it didn't seemed to be reproducible, so maybe on slower machines or whatever it isn't a problem. I would like to keep it there until we have seen reports of it failing in CI.

ggwpez commented 1 year ago

Ping: Two more tests added; beefy_reports_equivocations and response_headers_invalid_call.

ggwpez commented 1 year ago

Not sure who to ping for ensure_parallel_execution and execute_queue_doesnt_stall_with_varying_executor_params.
Git shows that @mrcnski and @s0me0ne-unkn0wn have worked with the code, maybe one of you?

s0me0ne-unkn0wn commented 1 year ago

@ggwpez both tests are driven by calculated timeouts which is flaky by nature, we just didn't expect CI runners to have such a significant divergence in performance :/

I'll look into it.

ggwpez commented 1 year ago

This case it is not about CI runners, it failed on Gav's PC.
In general I dont know if we can assert timeouts without running it on fixed hardware. Maybe just remove those checks? Or only run that last timing check when a CI Env variable is present.

s0me0ne-unkn0wn commented 1 year ago

We could get rid of them, of course, but we would still want to check somehow that queues are behaving as expected, that is, that they are running jobs in parallel, not sequentially, and that they can kill workers and spawn new ones depending on conditions, and that's hardly achievable if not relying on timeouts. To only run them in CI sounds like a legit idea. Maybe limiting them to the testnet profile is enough?

bkchr commented 1 year ago

Maybe limiting them to the testnet profile is enough?

We don't compile test with this profile. We could add some special env variable. However, while looking at the test, could we not just spawn both invocations and check if the test process has started two child processes (the workers)?

s0me0ne-unkn0wn commented 1 year ago

We don't compile test with this profile

I believe we do (at least for Polkadot): https://github.com/paritytech/polkadot/blob/master/scripts/ci/gitlab/pipeline/test.yml#L44

ggwpez commented 1 year ago

Even that does not guarantee that we are actually in CI. Only an env var would, and should be easy to implement.

paritytech / polkadot-sdk

☂ Fix flaky tests #48