valkey-io / valkey-glide

An open source Valkey client library that supports Valkey and Redis open source 6.2, 7.0 and 7.2. Valkey GLIDE is designed for reliability, optimized performance, and high-availability, for Valkey and Redis OSS based applications. GLIDE is a multi language client library, written in Rust with programming language bindings, such as Java and Python
Apache License 2.0
263 stars 53 forks source link

CI Testing: Various flakiness during the CI tests #2365

Open ikolomi opened 1 month ago

ikolomi commented 1 month ago

Describe the bug

More than often CI tests require rerunning in order to pass. In addition, warnings can be seen during the tests. This situation is unhealthy, building a sense of uncertainty regarding the code quality. We need an owner for the CI testing who will be able to maintain and make this system flourish

Expected Behavior

All tests are passing in the CI, no unexplained warning in the logs.

Current Behavior

Test are flaky, requiring reruns, unexplained logs and warnings during the tests

Reproduction Steps

Happens during PR checks

Possible Solution

No response

Additional Information/Context

No response

Client version used

Dont care

Engine type and version

Dont care

OS

Dont care

Language

TypeScript

Language Version

Dont care

Cluster information

No response

Logs

No response

Other information

No response

avifenesh commented 1 month ago
 glide.cluster.CommandTests.fcall_readonly_function(): FAILURE 0.212s

CommandTests > fcall readonly function FAILED
    org.opentest4j.AssertionFailedError: expected: <true> but was: <false>
        at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
        at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
        at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
        at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
        at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
        at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)
        at app//glide.cluster.CommandTests.fcall_readonly_function(CommandTests.java:1672)
Yury-Fridlyand commented 1 month ago

glide.cluster.CommandTests.fcall_readonly_function fixed in #2350

Yury-Fridlyand commented 1 month ago

GLIDE code rust tests are flaky: https://github.com/valkey-io/valkey-glide/actions/runs/11297525637/job/31424597153?pr=2439

standalone_client_tests::test_read_from_replica_round_robin_do_not_read_from_disconnected_replica
``` test standalone_client_tests::test_read_from_replica_round_robin_do_not_read_from_disconnected_replica ... 2024-10-11T18:38:05.348584Z DEBUG logger_core: connection creation - Attempting connection to Host: `localhost`, Port: 46063 2024-10-11T18:38:05.348720Z DEBUG logger_core: connection creation - Attempting connection to Host: `localhost`, Port: 35599 2024-10-11T18:38:05.348840Z DEBUG logger_core: connection creation - Attempting connection to Host: `localhost`, Port: 40739 2024-10-11T18:38:05.349043Z DEBUG logger_core: connection creation - Attempting connection to Host: `localhost`, Port: 42883 2024-10-11T18:38:05.349247Z INFO redis::aio::connection: Creating TCP connection for node: "localhost:46063", IP: ::1 2024-10-11T18:38:05.349279Z INFO redis::aio::connection: Creating TCP connection for node: "localhost:46063", IP: 127.0.0.1 2024-10-11T18:38:05.349503Z INFO redis::aio::connection: Creating TCP connection for node: "localhost:35599", IP: ::1 2024-10-11T18:38:05.349532Z INFO redis::aio::connection: Creating TCP connection for node: "localhost:35599", IP: 127.0.0.1 2024-10-11T18:38:05.349674Z INFO redis::aio::connection: Creating TCP connection for node: "localhost:40739", IP: ::1 2024-10-11T18:38:05.349720Z INFO redis::aio::connection: Creating TCP connection for node: "localhost:40739", IP: 127.0.0.1 2024-10-11T18:38:05.349864Z INFO redis::aio::connection: Creating TCP connection for node: "localhost:42883", IP: ::1 2024-10-11T18:38:05.349908Z INFO redis::aio::connection: Creating TCP connection for node: "localhost:42883", IP: 127.0.0.1 2024-10-11T18:38:05.351826Z DEBUG logger_core: connection creation - Connection to localhost:46063 created 2024-10-11T18:38:05.362656Z DEBUG logger_core: connection creation - Connection to localhost:35599 created 2024-10-11T18:38:05.362833Z DEBUG logger_core: connection creation - Connection to localhost:40739 created 2024-10-11T18:38:05.363301Z DEBUG logger_core: connection creation - Connection to localhost:42883 created 2024-10-11T18:38:05.364378Z DEBUG logger_core: StandaloneClient - connection checker has triggered reconnect 2024-10-11T18:38:05.364407Z DEBUG logger_core: reconnect - starting thread 'standalone_client_tests::test_read_from_replica_round_robin_do_not_read_from_disconnected_replica' panicked at tests/test_standalone_client.rs:242:9: assertion `left == right` failed left: [2, 3] right: [3, 3] note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace FAILED ```
avifenesh commented 4 weeks ago

Node tests timeout from time to time, failure points changes.

avifenesh commented 3 weeks ago

https://github.com/valkey-io/valkey-glide/actions/runs/11477149214/job/31938639591?pr=2500 Java CI This step has been truncated due to its large size. Download the full logs from the menu once the workflow run has completed.

avifenesh commented 3 weeks ago

GLIDE code rust tests are flaky: https://github.com/valkey-io/valkey-glide/actions/runs/11297525637/job/31424597153?pr=2439

standalone_client_tests::test_read_from_replica_round_robin_do_not_read_from_disconnected_replica

Fixed in one of my open PRs

avifenesh commented 2 weeks ago

https://github.com/valkey-io/valkey-glide/actions/runs/11612733982/job/32336884037?pr=2558

Scripts tests timing out in node. There's at least 1500 sleep time as part of the tests. It's enough that the machine is a bit under load and the tests run slow or wait, and it fails. Need to have higher timeouts for tests with long sleeps, plus a loop with 4000 timeout limit — if the test fails, the error is timeout from the framework and from the test. Test timeout need to be higher, or less sleep during test.

avifenesh commented 2 weeks ago

https://github.com/valkey-io/valkey-glide/actions/runs/11612834623/job/32337281537

Occasionally, tests on self-hosted getError: File was unable to be removed Error: EACCES: permission denied, unlink ... Probably interfering tests on the host.

avifenesh commented 2 weeks ago

https://github.com/valkey-io/valkey-glide/actions/runs/11643552677/job/32424432651#step:5:1010 https://github.com/valkey-io/valkey-glide/actions/runs/11643552677/job/32424430639#step:5:1062 Tests with wait command in node always flaky, it appears that the pace of the run affects the result and sometimes there's a command that arrives to one of the replicas, sometimes not. In multi the client doesn't wait for a response on wait command, it's returned immediately. Since node is one process it makes sense that sometime, there's a command arriving to a replica right before wait and occasionally the pipeline is already clean, especially when running in a very slow env.

These command need to be removed from transaction tests.

avifenesh commented 2 weeks ago

Generally — as can be seen in the full run, as results from the minimal testing we had until this point, we're seeing many failures. Some are results of flakiness, and some are simply bugs that weren't tested. ​ @ikolomi I think we might need to freeze for one day and divide the suites between the team, and stop the run until we know we are green. I don't think releasing while we are red is something we want to accept. The decision you'll take now, is what every RM will do, you are the first, and you'll make the tone. Please consider it.