Open Yuval-Ariel opened 5 months ago
for the IsDone assertion failure, my findings are: When the stress test resumes and finds a trace file and state file with a seq number prefix e.g. 0.state , The stress test tries to verify the trace file holds all the changes done to the db since that seq number. It does that by replaying the trace file and counting how many put, delete and rangedelete there were. Then it compares that to how many seq numbers have changed since that seq on the file (IsDone assertion). The assumption that the gap in seq numbers should be equivalent to the number of write and delete operations recorded in the trace file seems very weak to me. The txn db for example doesnt keep those assumptions and it seems like there are other instances where this assumption is false.
since we want to keep the trace file for debugging, for now, the workaround can be that we'll avoid this assertion and only print the numbers (num_writeops and max_writeops) for future debugging.
workaround will be done in - https://github.com/speedb-io/speedb/pull/816
further documentation in ~/expected_err in instance 173
There are several issues:
ExpectedStateTraceRecordHandler IsDone Assertion
During FileExpectedStateManager::Restore(DB* db), a ExpectedStateTraceRecordHandler is created with max_write_ops which is the gap between the current db seq number and the seq found on the state file. The trace record replayer replays the trace file and increments num_writeops on each operation. IsDone checks that num_writeops == max_writeops. When using use_txn=1 and txn_write_policy=1, there are about half num_write_ops than the max. This always happens.
use_txn with reopen > 1 - fails with corruption.
After reopen, a new state is created with a seq > 0. This happens since we added the trace_ops flag. However, When db_stress is rerun on the db, it recognizes theres a state to recover from and a trace file too. But the seq from the db is 0 which is lower than in the state. (in https://github.com/speedb-io/speedb/blob/5fd550e2440ec9d9117ca4d7d2a693cffe6a4df7/db_stress_tool/expected_state.cc#L630-L631 ) Cmd to reproduce:
Txn with write policy 1 creates a trace file which is unreadable.
All of the above came up since we've added #797 which creates a trace file by default.