superfly / litefs

FUSE-based file system for replicating SQLite databases across a cluster of machines
Apache License 2.0
3.78k stars 89 forks source link

Recover on matching TXID with checksum mismatch #138

Closed kentcdodds closed 1 year ago

kentcdodds commented 1 year ago

I just deployed LiteFS to den and maa and I'm getting a bunch of logs that I'm concerned about:

2022-10-25T22:15:01.166 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:01.286 app[2a69c631] den [info] stream connected

2022-10-25T22:15:01.287 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320

2022-10-25T22:15:01.420 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:02.670 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:02.790 app[2a69c631] den [info] stream connected

2022-10-25T22:15:02.924 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:04.182 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:04.437 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:04.753 app[2a69c631] den [info] HEAD / 200 89545 - 49.571 ms

2022-10-25T22:15:04.755 app[2a69c631] den [info] GET /healthcheck 200 - - 64.283 ms

2022-10-25T22:15:05.658 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:05.780 app[2a69c631] den [info] stream connected

2022-10-25T22:15:05.780 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320

2022-10-25T22:15:05.913 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:07.138 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:07.259 app[2a69c631] den [info] stream connected

2022-10-25T22:15:07.262 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320

2022-10-25T22:15:07.394 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:08.615 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:08.735 app[2a69c631] den [info] stream connected

2022-10-25T22:15:08.737 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320

2022-10-25T22:15:08.875 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:10.107 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:10.228 app[2a69c631] den [info] stream connected

2022-10-25T22:15:10.228 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320

2022-10-25T22:15:10.362 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

I'm not sure what these mean 🤔

benbjohnson commented 1 year ago

@kentcdodds That's an interesting error you caught there:

2022-10-25T22:15:01.420 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

It's saying that the two nodes are at the same transaction ID but have different checksums. This can happen if a change isn't replicated before a node is shutdown but it looks like it's not recovering on its own like it should. I think I found the issue and I can push a fix tomorrow.

benbjohnson commented 1 year ago

This should be fixed with https://github.com/superfly/litefs/pull/158

I'm going to be doing some more extensive testing tomorrow. I'm going to close now but let me know if you run into this issue again.