palantir / atlasdb

Transactional Distributed Database Layer
https://palantir.github.io/atlasdb/
Apache License 2.0
46 stars 7 forks source link

improvement: Sleep between attempts to validate invariants in the workload server, and retry #6994

Closed jeremyk-91 closed 4 months ago

jeremyk-91 commented 4 months ago

General

Before this PR: We only attempt to validate each workflow once. While faults have stopped by this point, Cassandra or Timelock may still be recovering from faults in this state.

After this PR:

==COMMIT_MSG== If an exception is thrown when validating an invariant, we sleep for five seconds and then attempt validation again. This likely reduces the false positive rate of the workload server. ==COMMIT_MSG==

Priority: P2

Concerns / possible downsides (what feedback would you like?):

Is documentation needed?: No.

Compatibility

Skipping section as this only affects the workload server.

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?: That we care about 50 second recovery times.

What was existing testing like? What have you done to improve it?: Not too much: CI should check that this doesn't generally break things, and then it's a question

If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.: Nothing in particular

If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?: No

Execution

Skipping section: workload server only.

Scale

Skipping section: workload server only.

Development Process

Where should we start reviewing?: The one file

If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:

Please tag any other people who should be aware of this PR: @jeremyk-91 @sverma30 @raiju

jeremyk-91 commented 4 months ago

yep, that makes a lot of sense! I'll merge this first in the interest of having the framework improve for short-term testing, and open a follow up that sorts some of this out