Before this PR:
We have a couple heavy write pattern workflows, but we don't spread them across the token range nor are they really heavy writes.
This means we can't test things like Sweep's resilience to failing Cassandra nodes, or whether Sweep can keep up in the presence of heavy writes.
After this PR:
A new workflow that performs a bunch of writes across a bunch of cells.
==COMMIT_MSG==
==COMMIT_MSG==
Priority: P2
Concerns / possible downsides (what feedback would you like?):
Too many writes?
This will come with changes to our internal Atlas TombstoneOverwhelmingEquivalent error to tune down the threshold before we fail on too many atlas tombstones - just not in this PR
Unlike other workflows, I heavily lean into randomisation here so that the fuzzer can explore interesting code paths. Is there anywhere I should add more/less randomisation?
Is the logging format correct?
Testing and Correctness
What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?:
None
What was existing testing like? What have you done to improve it?:
Added tests
Execution
How would I tell this PR works in production? (Metrics, logs, etc.):
Does not fail Antithesis tests! Or it does, but catches legitimate problems.
Development Process
Where should we start reviewing?:
MBCW
If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:
N/A
Please tag any other people who should be aware of this PR:
@jeremyk-91
@sverma30
@raiju
General
Before this PR: We have a couple heavy write pattern workflows, but we don't spread them across the token range nor are they really heavy writes.
This means we can't test things like Sweep's resilience to failing Cassandra nodes, or whether Sweep can keep up in the presence of heavy writes. After this PR: A new workflow that performs a bunch of writes across a bunch of cells.
==COMMIT_MSG== ==COMMIT_MSG==
Priority: P2
Concerns / possible downsides (what feedback would you like?): Too many writes? This will come with changes to our internal Atlas TombstoneOverwhelmingEquivalent error to tune down the threshold before we fail on too many atlas tombstones - just not in this PR Unlike other workflows, I heavily lean into randomisation here so that the fuzzer can explore interesting code paths. Is there anywhere I should add more/less randomisation?
Is the logging format correct?
Testing and Correctness
What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?: None What was existing testing like? What have you done to improve it?: Added tests
Execution
How would I tell this PR works in production? (Metrics, logs, etc.): Does not fail Antithesis tests! Or it does, but catches legitimate problems.
Development Process
Where should we start reviewing?: MBCW If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?: N/A Please tag any other people who should be aware of this PR: @jeremyk-91 @sverma30 @raiju