There are a couple of use cases that needs to be tested:
If the KeyDb is down when the server is started, we today wait for 20 retries and then returns an error. That error should be handled better and a specific error page should be used.
If the KeyDB is down and then comes up again with configured storage (using aof) the server gets an error. The plan is that the server and test runners would come up again and continue to work on the current jobs, so something is broken there.
Verify that test runners come back online when the KeyDB is down and then comes up.
There are a couple of use cases that needs to be tested: