palantir / atlasdb

Transactional Distributed Database Layer
https://palantir.github.io/atlasdb/
Apache License 2.0
46 stars 7 forks source link

[Antithesis] Multiple Busy Cell Workflow for Sweep #7020

Closed mdaudali closed 4 months ago

mdaudali commented 4 months ago

General

Before this PR: We have a couple heavy write pattern workflows, but we don't spread them across the token range nor are they really heavy writes.

This means we can't test things like Sweep's resilience to failing Cassandra nodes, or whether Sweep can keep up in the presence of heavy writes. After this PR: A new workflow that performs a bunch of writes across a bunch of cells.

==COMMIT_MSG== ==COMMIT_MSG==

Priority: P2

Concerns / possible downsides (what feedback would you like?): Too many writes? This will come with changes to our internal Atlas TombstoneOverwhelmingEquivalent error to tune down the threshold before we fail on too many atlas tombstones - just not in this PR Unlike other workflows, I heavily lean into randomisation here so that the fuzzer can explore interesting code paths. Is there anywhere I should add more/less randomisation?

Is the logging format correct?

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?: None What was existing testing like? What have you done to improve it?: Added tests

Execution

How would I tell this PR works in production? (Metrics, logs, etc.): Does not fail Antithesis tests! Or it does, but catches legitimate problems.

Development Process

Where should we start reviewing?: MBCW If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?: N/A Please tag any other people who should be aware of this PR: @jeremyk-91 @sverma30 @raiju