palantir / atlasdb

Transactional Distributed Database Layer
https://palantir.github.io/atlasdb/
Apache License 2.0
45 stars 7 forks source link

[Antithesis] Retry Verification Artifacts #7035

Closed jeremyk-91 closed 3 months ago

jeremyk-91 commented 3 months ago

General

Before this PR: The current Antithesis retry logic will retry even if exceptions are thrown during the execution of the reporting function. @LucasIME pointed out on the previous PR that this is not ideal, because it precludes reporters that throw exceptions, and also might do unnecessary work if there's accidentally a bug in the reporter.

After this PR:

==COMMIT_MSG== Antithesis retry logic now only retries failures in checking the invariant, not in reporting the violations produced by checking an invariant. ==COMMIT_MSG==

Priority: low P2

Concerns / possible downsides (what feedback would you like?):

Is documentation needed?: No

Compatibility

Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?: YES described above, though InvariantReporter is an internal API so I'm not too worried.

Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?: No

The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.): Workload Server is not deployed in a blue-green configuration.

Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?: N/A

Does this PR need a schema migration? No

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?: Reporters throwing isn't something we want to retry. So far these are primarily logging or writing metrics so they shouldn't throw.

What was existing testing like? What have you done to improve it?: Added a test for the specific case, updated existing tests to match new behaviour

If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.: N/A

If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?: N/A

Execution

How would I tell this PR works in production? (Metrics, logs, etc.): Invariants continue to work

Has the safety of all log arguments been decided correctly?: N/A

Will this change significantly affect our spending on metrics or logs?: No

How would I tell that this PR does not work in production? (monitors, etc.): Invariants break

If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?: Rollback

If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):

Scale

Not deployed to production.

Development Process

Where should we start reviewing?: Invariant / InvariantReporter, then the AntithesisWorkflowValidatorRunnerTest.

If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?: N/A

Please tag any other people who should be aware of this PR: @jeremyk-91 @sverma30 @raiju