intermittent CI failures have gotten a bit out of hand lately.
This PR addresses the most frequent failure, which is why I'm suddenly touching redis instead of kafka.
The last time I looked into this I ended up quite confused, but that was 2 years ago and the solution was a lot easier to find this time around.
We run CLUSTER NODES to fetch a master node in get_master_id.
While redis is starting up CLUSTER NODES can return bizarre results where replicas are reported as masters.
The results include more masters than slaves, which is impossible because the cluster is configured with 3 masters and 3 slaves. After a while this resolves itself.
Here is an example of such an output from CLUSTER NODES:
The fix to this problem is to extend the retry logic to call get_master_id again to refetch the master ID in case it was incorrectly reported.
2
Increase the timeout from ~5s to 30s, on my machine after the above fix, very very rarely we would hit this 5s mark, but after increasing to 30s I have never seen a failure even after 100 runs of the test. While I was at it, I rewrote the timeout logic to be easier to read and more accurate.
intermittent CI failures have gotten a bit out of hand lately. This PR addresses the most frequent failure, which is why I'm suddenly touching redis instead of kafka.
The last time I looked into this I ended up quite confused, but that was 2 years ago and the solution was a lot easier to find this time around.
This PR extends the fix in https://github.com/shotover/shotover-proxy/pull/745 with two extra changes:
1
We run
CLUSTER NODES
to fetch a master node inget_master_id
. While redis is starting upCLUSTER NODES
can return bizarre results where replicas are reported as masters. The results include more masters than slaves, which is impossible because the cluster is configured with 3 masters and 3 slaves. After a while this resolves itself.Here is an example of such an output from
CLUSTER NODES
:The fix to this problem is to extend the retry logic to call
get_master_id
again to refetch the master ID in case it was incorrectly reported.2
Increase the timeout from ~5s to 30s, on my machine after the above fix, very very rarely we would hit this 5s mark, but after increasing to 30s I have never seen a failure even after 100 runs of the test. While I was at it, I rewrote the timeout logic to be easier to read and more accurate.