paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.77k stars 633 forks source link

☂ Fix flaky tests #48

Open ggwpez opened 2 years ago

ggwpez commented 2 years ago

Some tests just randomly fail in the CI.

Known bad with last confirmation date:

Maybe flaky, maybe fixed :ghost::

Fixed:

niklasad1 commented 2 years ago

Ok, I see.

This might be quite tricky find "free" ports to use for libp2p. A first step would be to ensure that the CLI tests assigns unique ports for libp2p.

ggwpez commented 1 year ago

The babe test authoring_blocks failed again.
tests::authoring_blocks' panicked at 'importing block failed: ClientImport("Slot number must increase: parent slot: 1669811014, this slot: 1669811014")

bkchr commented 1 year ago

telemetry_works seems to be another flaky test.

When running:

while cargo test --release -p node-cli --test telemetry; do true; done

The test will fail at some point. When adding some more "debugging" the following error is shown:

[test-utils/cli/src/lib.rs:236] self.wait().unwrap() = ExitStatus(
    unix_wait_status(
        139,
    ),
)

This indicates at the spawned Substrate process is dying because of some segmentation fault. I assume the underlying problem is not related to telemetry, as it happens on shutting down the node. (Maybe still related to telemetry and only happens because the worker is doing something that it shouldn't be doing)

ggwpez commented 1 year ago

telemetry_works seems to be another flaky test.

Confirmed and added. Should we comment it until fixed?

bkchr commented 1 year ago

Confirmed and added. Should we comment it until fixed?

I don't have seen it in CI so far, only locally on my machine. Also in debug it didn't seemed to be reproducible, so maybe on slower machines or whatever it isn't a problem. I would like to keep it there until we have seen reports of it failing in CI.

ggwpez commented 1 year ago

Ping: Two more tests added; beefy_reports_equivocations and response_headers_invalid_call.

ggwpez commented 1 year ago

Not sure who to ping for ensure_parallel_execution and execute_queue_doesnt_stall_with_varying_executor_params.
Git shows that @mrcnski and @s0me0ne-unkn0wn have worked with the code, maybe one of you?

s0me0ne-unkn0wn commented 1 year ago

@ggwpez both tests are driven by calculated timeouts which is flaky by nature, we just didn't expect CI runners to have such a significant divergence in performance :/

I'll look into it.

ggwpez commented 1 year ago

This case it is not about CI runners, it failed on Gav's PC.
In general I dont know if we can assert timeouts without running it on fixed hardware. Maybe just remove those checks? Or only run that last timing check when a CI Env variable is present.

s0me0ne-unkn0wn commented 1 year ago

We could get rid of them, of course, but we would still want to check somehow that queues are behaving as expected, that is, that they are running jobs in parallel, not sequentially, and that they can kill workers and spawn new ones depending on conditions, and that's hardly achievable if not relying on timeouts. To only run them in CI sounds like a legit idea. Maybe limiting them to the testnet profile is enough?

bkchr commented 1 year ago

Maybe limiting them to the testnet profile is enough?

We don't compile test with this profile. We could add some special env variable. However, while looking at the test, could we not just spawn both invocations and check if the test process has started two child processes (the workers)?

s0me0ne-unkn0wn commented 1 year ago

We don't compile test with this profile

I believe we do (at least for Polkadot): https://github.com/paritytech/polkadot/blob/master/scripts/ci/gitlab/pipeline/test.yml#L44

ggwpez commented 1 year ago

Even that does not guarantee that we are actually in CI. Only an env var would, and should be easy to implement.