Closed ayuhito closed 1 year ago
@ayuhito Thanks for reporting this. There's a window of time where a database file can be created but doesn't have any data yet. It looks like the snapshot functionality isn't handling that correctly. I'll add in a fix for it to skip over databases without data.
My setup is very similar to litefs-indie but uses Knex and better-sqlite3. We have two regions and I also don't have any persistent volumes setup. Is it necessary to have volumes if I don't mind blasting my DB every deploy?
You shouldn't need persistent volumes if you're not worried about losing your data if all nodes go down.
@ayuhito This may have been resolved by #184. However, I also added #185 to improve logging for snapshots so it'll help give a bit more info. If you still have your cluster running, can you try it out please? The Docker image is flyio/litefs:pr-185
.
@benbjohnson, a new challenger seems to have appeared.
2022-11-15T22:52:17.699 app[031c32e2] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file
My first deployment returned with the commit record error, but that was because of the rolling restart behaviour since I'm running two instances. One of them was still running the old version. Do you suggest something like the bluegreen
deployment strategy to work with LiteFS in an isolated manner for the DBs? Or will new versions try to interact with old VMs?
I can imagine this being an issue when migrations are deployed and a VM falls out of sync in deployment causing application errors and possibly health check fails.
After a full restart, things were fine for a short while until I got these invalid LTX file errors when the second instance tried to deploy.
Thanks for your work! Really appreciate it!
@ayuhito I believe the invalid LTX file
error was in your original post as well. That's caused because the server is disconnecting because of the commit record required
error. I'll do some more testing on my side to try to reproduce. If you have an example app that I can try, that would help too.
I tried one other possible fix (https://github.com/superfly/litefs/pull/187). Can you give this a whirl? I'll backfill some tests if this actually works. :)
The Docker image is: flyio/litefs:pr-187
@benbjohnson, ah I'm extremely tired right now so I missed the repeated error in the logs. Apologies :sweat_smile:
Unfortunately, #187 didn't work. Looking at the logs, the first instance deployed fine but things broke deploying the second instance.
I'll try to get a minimal reproduction for you forking litefs-indie
and strip out Prisma for knex for simplicity, however, I probably can't get it to you today, unfortunately.
@ayuhito No problem! Thanks for all the help trying to debug this issue.
@ayuhito I was able to reproduce on my side. I'll get it fixed up. 👍
@ayuhito Ok, it looks like this PR fixes the issue. The database size is kept in-memory but it looks like I missed a spot where it should update when receiving an LTX file from a primary. Then when a deploy happens without a volume then the old replica becomes primary and the new replica tries to snapshot (since it doesn't have a persistent volume) and it hits this missing page count issue.
I've done a bunch of deploys and I haven't hit it again. I'm going to close the issue for now but let me know if you still experience the issue. Thanks again for all your help!!
@benbjohnson, it works great! Thank you so much for your time!
I've been running into some errors which boils down to:
Full Logs
```shell 2022-11-11T17:26:02.733 app[50616fcd] fra [info] Starting init (commit: 81d5330)... 2022-11-11T17:26:02.761 app[50616fcd] fra [info] Preparing to run: `docker-entrypoint.sh litefs -- node ./start.js` as root 2022-11-11T17:26:02.793 app[50616fcd] fra [info] 2022/11/11 17:26:02 listening on [fdaa:0:9732:a7b:86:5061:6fcd:2]:22 (DNS: [fdaa::3]:53) 2022-11-11T17:26:02.841 app[50616fcd] fra [info] LiteFS v0.3.0-beta1, commit=f40cf36930e1d0b259bbbdb72c3fd3a508ff1936 2022-11-11T17:26:02.841 app[50616fcd] fra [info] config file read from /app/litefs.yml 2022-11-11T17:26:02.841 app[50616fcd] fra [info] Using Consul to determine primary 2022-11-11T17:26:03.963 app[50616fcd] fra [info] initializing consul: key= url=https://:e56d6a0c-6c59-e30e-fc67-3e412f38afb9@consul-fra-3.fly-shared.net/fontsource-6r85yqled4r92pvl/ hostname=50616fcd advertise-url=http://50616fcd.vm.fontsource.internal:20202 2022-11-11T17:26:03.970 app[50616fcd] fra [info] LiteFS mounted to: /litefs/data 2022-11-11T17:26:03.970 app[50616fcd] fra [info] http server listening on: http://localhost:20202 2022-11-11T17:26:03.970 app[50616fcd] fra [info] waiting to connect to cluster 2022-11-11T17:26:03.972 app[50616fcd] fra [info] existing primary found (6414fe27), connecting as replica 2022-11-11T17:26:04.205 app[6414fe27] sjc [info] stream connected 2022-11-11T17:26:04.205 app[6414fe27] sjc [info] transaction file for txid 0000000000000001 no longer available, resetting to snapshot 2022-11-11T17:26:04.205 app[6414fe27] sjc [info] http: error: stream error: db="sqlite.db" err=stream ltx (tx 0): write ltx snapshot file: encode ltx header: commit record required 2022-11-11T17:26:04.205 app[6414fe27] sjc [info] stream disconnected 2022-11-11T17:26:04.279 app[50616fcd] fra [info] replica disconnected, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-11T17:26:05.398 app[50616fcd] fra [info] existing primary found (6414fe27), connecting as replica 2022-11-11T17:26:05.473 app[6414fe27] sjc [info] stream connected 2022-11-11T17:26:05.473 app[6414fe27] sjc [info] transaction file for txid 0000000000000001 no longer available, resetting to snapshot 2022-11-11T17:26:05.473 app[6414fe27] sjc [info] http: error: stream error: db="sqlite.db" err=stream ltx (tx 0): write ltx snapshot file: encode ltx header: commit record required 2022-11-11T17:26:05.473 app[6414fe27] sjc [info] stream disconnected 2022-11-11T17:26:05.546 app[50616fcd] fra [info] replica disconnected, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-11T17:26:07.073 app[50616fcd] fra [info] existing primary found (6414fe27), connecting as replica 2022-11-11T17:26:07.148 app[6414fe27] sjc [info] stream connected 2022-11-11T17:26:07.148 app[6414fe27] sjc [info] transaction file for txid 0000000000000001 no longer available, resetting to snapshot 2022-11-11T17:26:07.148 app[6414fe27] sjc [info] http: error: stream error: db="sqlite.db" err=stream ltx (tx 0): write ltx snapshot file: encode ltx header: commit record required 2022-11-11T17:26:07.148 app[6414fe27] sjc [info] stream disconnected 2022-11-11T17:26:07.221 app[50616fcd] fra [info] replica disconnected, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-11T17:26:08.235 app[50616fcd] fra [info] existing primary found (6414fe27), connecting as replica 2022-11-11T17:26:08.310 app[6414fe27] sjc [info] stream connected 2022-11-11T17:26:08.310 app[6414fe27] sjc [info] transaction file for txid 0000000000000001 no longer available, resetting to snapshot 2022-11-11T17:26:08.310 app[6414fe27] sjc [info] http: error: stream error: db="sqlite.db" err=stream ltx (tx 0): write ltx snapshot file: encode ltx header: commit record required 2022-11-11T17:26:08.310 app[6414fe27] sjc [info] stream disconnected 2022-11-11T17:26:08.383 app[50616fcd] fra [info] replica disconnected, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file ... and so on... ``` _(note the logs indicate an older LiteFS version but the same issue happened with the newer one anyways)_My setup is very similar to
litefs-indie
but uses Knex and better-sqlite3. We have two regions and I also don't have any persistent volumes setup. Is it necessary to have volumes if I don't mind blasting my DB every deploy?GH Actions Repository
This has been tested against the latest Docker version
flyio/litefs:sha-fabf62d
.Originally posted in https://github.com/superfly/litefs/issues/167#issuecomment-1312060227