Open petuhovskiy opened 1 week ago
What is the impact? Does the compute just reconnect, or is something worse happening?
This error is the safety measure against data corruption, it checks that each WAL write to disk starts exactly after the previous write. This check is quite low level (checks writes to disk only) and it means that something is wrong with the consensus algorithm which ordered these writes. In theory it could be a serious issue with consensus and the impact can be big.
But in the example (logs attached) the impact is minimal, because the compute hasn't changed and the error was triggered only because of unsual network latencies.
The current issue is that this error is triggered quite often in the logs (~10-20 times per week, usually after redeploy) and it's impossible to verify that all such errors are harmless. The idea is to fix such kind of issues caused by the network latency and ensure they are covered by the simulation tests. This will help to prevent potentially bad things that can happen together with this error.
Chatted with Arthur, this behaviour is generally expected, but indeed check should be also present at higher level, not only in wal_storage.rs. To be more careful we could also enumerate walproposer connections and compare these numbers, but this is more invasive change.
Understanding why simulation doesn't complain on this would be also great.
Steps to reproduce
As I understand, it can happen when compute<->sk connection breaks without a compute restart.
If networks lags for a bit, a scenario like this can happen:
We need to investigate why this exact error is not reproduced in the simulation tests.
Expected result
Shouldn't happen. Logs shouldn't have this error (?)
Actual result
Sometimes happens
Environment
any
Logs, links
https://neonprod.grafana.net/goto/bvL2_sQSg?orgId=1