palantir / atlasdb

Transactional Distributed Database Layer
https://palantir.github.io/atlasdb/
Apache License 2.0
45 stars 7 forks source link

improvement: Fix flakiness on CassandraClientPoolTest resilientToRollingRestarts method #7036

Closed LucasIME closed 3 months ago

LucasIME commented 3 months ago

General

Before this PR:

This test was previously flaky (example), failing to assert that the node we forced the failure on was present on the blacklist.

There were a few problems that caused the test to behave in such way:

  1. CassandraClientPoolImpl has a scheduled task that keeps trying to refresh its own pool
  2. That tries to set the pool values to the current server list from config
  3. But our CassandraService used in these tests is mocked. And the value from the config is only set on setupThriftServers, which is called from clientPoolWith, which is always set to an empty set on the clientPoolWithServersInCurrentPool on the method we were using for test setup.
  4. That means that we were prone to the following race:
  5. We run runNoopWithRetryOnHost and the CASS_SERVER_1 added to the blacklist, but haven't yet checked the assertion
  6. Background task refresh pool task runs, trying to update pool to a an empty list (since we haven't provided the servers addresses)
  7. absentServers here is then all the servers in our currentPool, since desiredServers is an empty map. So we then call Cassandra.removePool here
  8. But the problem is that when you remove a server from the cassandra pool, it also gets removed from the blacklist. So now our blacklist is empty
  9. We then try to run the assertion here but the blacklist is already empty, so we fail

The relevant change is this PR is only to set the hosts to be returned by cassandra.getCurrentServerListFromConfig to match what we're actually using in the tests, so it doesn't get reset by the background task.

nit: when(config.autoRefreshNodes()).thenReturn(false); was not necessarily needed, since it was already returning false. But adding it for clarity, since when investigating the behaviour wasn't obvious to me: despite the default value for such config being True, if you don't provide the return for a mocked a boolean method, it will return false.

After this PR:

==COMMIT_MSG== CassandraClientPoolTest resilientToRollingRestarts() method shouldn't be flaky anymore. ==COMMIT_MSG==