shotover / shotover-proxy

L7 data-layer proxy
https://docs.shotover.io
Apache License 2.0
87 stars 18 forks source link

redis_int_tests: Fix intermittent CI failure #1792

Closed rukai closed 1 month ago

rukai commented 1 month ago

intermittent CI failures have gotten a bit out of hand lately. This PR addresses the most frequent failure, which is why I'm suddenly touching redis instead of kafka.

The last time I looked into this I ended up quite confused, but that was 2 years ago and the solution was a lot easier to find this time around.

This PR extends the fix in https://github.com/shotover/shotover-proxy/pull/745 with two extra changes:

1

We run CLUSTER NODES to fetch a master node in get_master_id. While redis is starting up CLUSTER NODES can return bizarre results where replicas are reported as masters. The results include more masters than slaves, which is impossible because the cluster is configured with 3 masters and 3 slaves. After a while this resolves itself.

Here is an example of such an output from CLUSTER NODES:

48a7231015d6f654ddc0dfa1901fe8e54018ba4f 172.16.1.5:6379@16379 master - 0 1730264125000 4 connected
b6f810c14390f85583579b07c684c153d9a734e5 172.16.1.6:6379@16379 master - 0 1730264126000 5 connected
d84c529dbb7eac1a521352e4e788688cafd977dc 172.16.1.7:6379@16379 slave a601ae6cbcbd31e45af2d07697b02c095411a1a8 0 1730264126889 2 connected
a601ae6cbcbd31e45af2d07697b02c095411a1a8 172.16.1.3:6379@16379 master - 0 1730264125000 2 connected 5461-10922
29f55eab0de305e5edc450fc1ff9aa83b0d29d9f 172.16.1.2:6379@16379 master - 0 1730264127892 1 connected 0-5460
bd8db891ce0bbdd833307d8ffc02fb6eb3532603 172.16.1.4:6379@16379 myself,master - 0 1730264125000 3 connected 10923-16383

The fix to this problem is to extend the retry logic to call get_master_id again to refetch the master ID in case it was incorrectly reported.

2

Increase the timeout from ~5s to 30s, on my machine after the above fix, very very rarely we would hit this 5s mark, but after increasing to 30s I have never seen a failure even after 100 runs of the test. While I was at it, I rewrote the timeout logic to be easier to read and more accurate.

codspeed-hq[bot] commented 1 month ago

CodSpeed Performance Report

Merging #1792 will not alter performance

Comparing rukai:fix_intermittent_redis_failure (7ded6fa) with main (5f20395)

Summary

✅ 38 untouched benchmarks