superfly / litefs

FUSE-based file system for replicating SQLite databases across a cluster of machines
Apache License 2.0
3.78k stars 89 forks source link

Seeing "malformed database" error in WAL replicas #142

Closed dangra closed 1 year ago

dangra commented 1 year ago

The error isn't fixed by resyncing the replica and it is temporal in the sense that it comes and goes away while the replication is happening.

2022-10-26T17:48:35Z app[4536d5c4] iad [info]    ** (Exqlite.Error) database disk image is malformed
benbjohnson commented 1 year ago

Per Slack conversation, this issue might be from using an old SQLite client version. Waiting on update to confirm or not. 👍

dangra commented 1 year ago

Seems not related to the client version. I isolated it in this repro https://github.com/dangra/litefs-bug-repro/tree/main/wal-replication-1

benbjohnson commented 1 year ago

@dangra Thanks for the repro. I was thinking that I only needed a lock on WAL_READ0_LOCK since I'm always checkpointing on the replica side, however, it looks like it needs WAL_READ0_LOCK through WAL_READ4_LOCK to be locked. I pushed up a PR: https://github.com/superfly/litefs/pull/148

However, I'm seeing really long latency (e.g. ~6s) for replica reads from your example. At first I thought it was lock contention but even when I drop the producer's write frequency to 1 write/sec it still shows high latency. I'm investigating that now.