superfly / litefs

FUSE-based file system for replicating SQLite databases across a cluster of machines
Apache License 2.0
4.01k stars 96 forks source link

Commit Record Required #173

Closed ayuhito closed 1 year ago

ayuhito commented 1 year ago

I've been running into some errors which boils down to:

http: error: stream error: db="sqlite.db" err=stream ltx (tx 0): write ltx snapshot file: encode ltx header: commit record required
Full Logs ```shell 2022-11-11T17:26:02.733 app[50616fcd] fra [info] Starting init (commit: 81d5330)... 2022-11-11T17:26:02.761 app[50616fcd] fra [info] Preparing to run: `docker-entrypoint.sh litefs -- node ./start.js` as root 2022-11-11T17:26:02.793 app[50616fcd] fra [info] 2022/11/11 17:26:02 listening on [fdaa:0:9732:a7b:86:5061:6fcd:2]:22 (DNS: [fdaa::3]:53) 2022-11-11T17:26:02.841 app[50616fcd] fra [info] LiteFS v0.3.0-beta1, commit=f40cf36930e1d0b259bbbdb72c3fd3a508ff1936 2022-11-11T17:26:02.841 app[50616fcd] fra [info] config file read from /app/litefs.yml 2022-11-11T17:26:02.841 app[50616fcd] fra [info] Using Consul to determine primary 2022-11-11T17:26:03.963 app[50616fcd] fra [info] initializing consul: key= url=https://:e56d6a0c-6c59-e30e-fc67-3e412f38afb9@consul-fra-3.fly-shared.net/fontsource-6r85yqled4r92pvl/ hostname=50616fcd advertise-url=http://50616fcd.vm.fontsource.internal:20202 2022-11-11T17:26:03.970 app[50616fcd] fra [info] LiteFS mounted to: /litefs/data 2022-11-11T17:26:03.970 app[50616fcd] fra [info] http server listening on: http://localhost:20202 2022-11-11T17:26:03.970 app[50616fcd] fra [info] waiting to connect to cluster 2022-11-11T17:26:03.972 app[50616fcd] fra [info] existing primary found (6414fe27), connecting as replica 2022-11-11T17:26:04.205 app[6414fe27] sjc [info] stream connected 2022-11-11T17:26:04.205 app[6414fe27] sjc [info] transaction file for txid 0000000000000001 no longer available, resetting to snapshot 2022-11-11T17:26:04.205 app[6414fe27] sjc [info] http: error: stream error: db="sqlite.db" err=stream ltx (tx 0): write ltx snapshot file: encode ltx header: commit record required 2022-11-11T17:26:04.205 app[6414fe27] sjc [info] stream disconnected 2022-11-11T17:26:04.279 app[50616fcd] fra [info] replica disconnected, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-11T17:26:05.398 app[50616fcd] fra [info] existing primary found (6414fe27), connecting as replica 2022-11-11T17:26:05.473 app[6414fe27] sjc [info] stream connected 2022-11-11T17:26:05.473 app[6414fe27] sjc [info] transaction file for txid 0000000000000001 no longer available, resetting to snapshot 2022-11-11T17:26:05.473 app[6414fe27] sjc [info] http: error: stream error: db="sqlite.db" err=stream ltx (tx 0): write ltx snapshot file: encode ltx header: commit record required 2022-11-11T17:26:05.473 app[6414fe27] sjc [info] stream disconnected 2022-11-11T17:26:05.546 app[50616fcd] fra [info] replica disconnected, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-11T17:26:07.073 app[50616fcd] fra [info] existing primary found (6414fe27), connecting as replica 2022-11-11T17:26:07.148 app[6414fe27] sjc [info] stream connected 2022-11-11T17:26:07.148 app[6414fe27] sjc [info] transaction file for txid 0000000000000001 no longer available, resetting to snapshot 2022-11-11T17:26:07.148 app[6414fe27] sjc [info] http: error: stream error: db="sqlite.db" err=stream ltx (tx 0): write ltx snapshot file: encode ltx header: commit record required 2022-11-11T17:26:07.148 app[6414fe27] sjc [info] stream disconnected 2022-11-11T17:26:07.221 app[50616fcd] fra [info] replica disconnected, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-11T17:26:08.235 app[50616fcd] fra [info] existing primary found (6414fe27), connecting as replica 2022-11-11T17:26:08.310 app[6414fe27] sjc [info] stream connected 2022-11-11T17:26:08.310 app[6414fe27] sjc [info] transaction file for txid 0000000000000001 no longer available, resetting to snapshot 2022-11-11T17:26:08.310 app[6414fe27] sjc [info] http: error: stream error: db="sqlite.db" err=stream ltx (tx 0): write ltx snapshot file: encode ltx header: commit record required 2022-11-11T17:26:08.310 app[6414fe27] sjc [info] stream disconnected 2022-11-11T17:26:08.383 app[50616fcd] fra [info] replica disconnected, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file ... and so on... ``` _(note the logs indicate an older LiteFS version but the same issue happened with the newer one anyways)_

My setup is very similar to litefs-indie but uses Knex and better-sqlite3. We have two regions and I also don't have any persistent volumes setup. Is it necessary to have volumes if I don't mind blasting my DB every deploy?

GH Actions Repository

This has been tested against the latest Docker version flyio/litefs:sha-fabf62d.

Originally posted in https://github.com/superfly/litefs/issues/167#issuecomment-1312060227

benbjohnson commented 1 year ago

@ayuhito Thanks for reporting this. There's a window of time where a database file can be created but doesn't have any data yet. It looks like the snapshot functionality isn't handling that correctly. I'll add in a fix for it to skip over databases without data.

benbjohnson commented 1 year ago

My setup is very similar to litefs-indie but uses Knex and better-sqlite3. We have two regions and I also don't have any persistent volumes setup. Is it necessary to have volumes if I don't mind blasting my DB every deploy?

You shouldn't need persistent volumes if you're not worried about losing your data if all nodes go down.

benbjohnson commented 1 year ago

@ayuhito This may have been resolved by #184. However, I also added #185 to improve logging for snapshots so it'll help give a bit more info. If you still have your cluster running, can you try it out please? The Docker image is flyio/litefs:pr-185.

ayuhito commented 1 year ago

@benbjohnson, a new challenger seems to have appeared.

2022-11-15T22:52:17.699 app[031c32e2] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file
Full Logs ```shell 2022-11-15T22:52:06.064 runner[031c32e2] sjc [info] Starting instance 2022-11-15T22:52:06.501 runner[031c32e2] sjc [info] Configuring virtual machine 2022-11-15T22:52:06.504 runner[031c32e2] sjc [info] Pulling container image 2022-11-15T22:52:10.189 runner[031c32e2] sjc [info] Unpacking image 2022-11-15T22:52:11.582 runner[031c32e2] sjc [info] Preparing kernel init 2022-11-15T22:52:11.871 runner[031c32e2] sjc [info] Configuring firecracker 2022-11-15T22:52:11.960 runner[031c32e2] sjc [info] Starting virtual machine 2022-11-15T22:52:12.185 app[031c32e2] sjc [info] Starting init (commit: 81d5330)... 2022-11-15T22:52:12.213 app[031c32e2] sjc [info] Preparing to run: `docker-entrypoint.sh litefs -- node ./start.js` as root 2022-11-15T22:52:12.239 app[031c32e2] sjc [info] 2022/11/15 22:52:12 listening on [fdaa:0:9732:a7b:b385:31c:32e2:2]:22 (DNS: [fdaa::3]:53) 2022-11-15T22:52:12.288 app[031c32e2] sjc [info] LiteFS commit=20d0f7e7f27d3e052e10ae08aeaf1fddac84d04f 2022-11-15T22:52:12.288 app[031c32e2] sjc [info] Using Consul to determine primary 2022-11-15T22:52:12.288 app[031c32e2] sjc [info] config file read from /app/litefs.yml 2022-11-15T22:52:15.692 app[031c32e2] sjc [info] initializing consul: key= url=https://:e56d6a0c-6c59-e30e-fc67-3e412f38afb9@consul-fra-3.fly-shared.net/fontsource-6r85yqled4r92pvl/ hostname=031c32e2 advertise-url=http://031c32e2.vm.fontsource.internal:20202 2022-11-15T22:52:15.702 app[031c32e2] sjc [info] LiteFS mounted to: /litefs/data 2022-11-15T22:52:15.702 app[031c32e2] sjc [info] http server listening on: http://localhost:20202 2022-11-15T22:52:15.702 app[031c32e2] sjc [info] waiting to connect to cluster 2022-11-15T22:52:16.006 app[031c32e2] sjc [info] existing primary found (dc1f4bd5), connecting as replica 2022-11-15T22:52:16.241 app[dc1f4bd5] fra [info] stream connected 2022-11-15T22:52:16.241 app[dc1f4bd5] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-15T22:52:16.241 app[dc1f4bd5] fra [info] writing snapshot "sqlite.db" @ 0000000000000004 2022-11-15T22:52:16.241 app[dc1f4bd5] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-15T22:52:16.241 app[dc1f4bd5] fra [info] stream disconnected 2022-11-15T22:52:16.318 app[031c32e2] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-15T22:52:17.548 app[031c32e2] sjc [info] existing primary found (dc1f4bd5), connecting as replica 2022-11-15T22:52:17.623 app[dc1f4bd5] fra [info] stream connected 2022-11-15T22:52:17.623 app[dc1f4bd5] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-15T22:52:17.623 app[dc1f4bd5] fra [info] writing snapshot "sqlite.db" @ 0000000000000004 2022-11-15T22:52:17.623 app[dc1f4bd5] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-15T22:52:17.623 app[dc1f4bd5] fra [info] stream disconnected 2022-11-15T22:52:17.699 app[031c32e2] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-15T22:52:18.857 app[031c32e2] sjc [info] existing primary found (dc1f4bd5), connecting as replica 2022-11-15T22:52:18.932 app[dc1f4bd5] fra [info] stream connected 2022-11-15T22:52:18.932 app[dc1f4bd5] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-15T22:52:18.932 app[dc1f4bd5] fra [info] writing snapshot "sqlite.db" @ 0000000000000004 2022-11-15T22:52:18.932 app[dc1f4bd5] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-15T22:52:18.933 app[dc1f4bd5] fra [info] stream disconnected 2022-11-15T22:52:19.009 app[031c32e2] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-15T22:52:20.166 app[031c32e2] sjc [info] existing primary found (dc1f4bd5), connecting as replica 2022-11-15T22:52:20.240 app[dc1f4bd5] fra [info] stream connected 2022-11-15T22:52:20.240 app[dc1f4bd5] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-15T22:52:20.241 app[dc1f4bd5] fra [info] writing snapshot "sqlite.db" @ 0000000000000004 2022-11-15T22:52:20.241 app[dc1f4bd5] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-15T22:52:20.241 app[dc1f4bd5] fra [info] stream disconnected 2022-11-15T22:52:20.317 app[031c32e2] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-15T22:52:21.792 app[031c32e2] sjc [info] existing primary found (dc1f4bd5), connecting as replica 2022-11-15T22:52:21.866 app[dc1f4bd5] fra [info] stream connected 2022-11-15T22:52:21.867 app[dc1f4bd5] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-15T22:52:21.867 app[dc1f4bd5] fra [info] writing snapshot "sqlite.db" @ 0000000000000004 2022-11-15T22:52:21.867 app[dc1f4bd5] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-15T22:52:21.867 app[dc1f4bd5] fra [info] stream disconnected 2022-11-15T22:52:21.944 app[031c32e2] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-15T22:52:23.336 app[031c32e2] sjc [info] existing primary found (dc1f4bd5), connecting as replica 2022-11-15T22:52:23.410 app[dc1f4bd5] fra [info] stream connected 2022-11-15T22:52:23.410 app[dc1f4bd5] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-15T22:52:23.410 app[dc1f4bd5] fra [info] writing snapshot "sqlite.db" @ 0000000000000004 2022-11-15T22:52:23.411 app[dc1f4bd5] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-15T22:52:23.411 app[dc1f4bd5] fra [info] stream disconnected 2022-11-15T22:52:23.487 app[031c32e2] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-15T22:52:24.685 app[031c32e2] sjc [info] existing primary found (dc1f4bd5), connecting as replica 2022-11-15T22:52:24.759 app[dc1f4bd5] fra [info] stream connected 2022-11-15T22:52:24.759 app[dc1f4bd5] fra [info] writing snapshot "sqlite.db" @ 0000000000000004 nger available, writing snapshot 2022-11-15T22:52:24.759 app[dc1f4bd5] fra [info] writing snapshot "sqlite.db" @ 0000000000000004 2022-11-15T22:52:24.760 app[dc1f4bd5] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-15T22:52:24.760 app[dc1f4bd5] fra [info] stream disconnected 2022-11-15T22:52:24.836 app[031c32e2] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file ```

My first deployment returned with the commit record error, but that was because of the rolling restart behaviour since I'm running two instances. One of them was still running the old version. Do you suggest something like the bluegreen deployment strategy to work with LiteFS in an isolated manner for the DBs? Or will new versions try to interact with old VMs? I can imagine this being an issue when migrations are deployed and a VM falls out of sync in deployment causing application errors and possibly health check fails.

After a full restart, things were fine for a short while until I got these invalid LTX file errors when the second instance tried to deploy.

Thanks for your work! Really appreciate it!

benbjohnson commented 1 year ago

@ayuhito I believe the invalid LTX file error was in your original post as well. That's caused because the server is disconnecting because of the commit record required error. I'll do some more testing on my side to try to reproduce. If you have an example app that I can try, that would help too.

I tried one other possible fix (https://github.com/superfly/litefs/pull/187). Can you give this a whirl? I'll backfill some tests if this actually works. :)

The Docker image is: flyio/litefs:pr-187

ayuhito commented 1 year ago

@benbjohnson, ah I'm extremely tired right now so I missed the repeated error in the logs. Apologies :sweat_smile:

Full Logs ```shell 2022-11-16T00:00:49.147 app[b871e8d6] fra [info] Starting init (commit: 81d5330)... 2022-11-16T00:00:49.173 app[b871e8d6] fra [info] Preparing to run: `docker-entrypoint.sh litefs -- node ./start.js` as root 2022-11-16T00:00:49.201 app[b871e8d6] fra [info] 2022/11/16 00:00:49 listening on [fdaa:0:9732:a7b:8a:b871:e8d6:2]:22 (DNS: [fdaa::3]:53) 2022-11-16T00:00:49.254 app[b871e8d6] fra [info] LiteFS commit=711005b51b46f17caec55a4af60459b53b52d0c7 2022-11-16T00:00:49.254 app[b871e8d6] fra [info] config file read from /app/litefs.yml 2022-11-16T00:00:49.254 app[b871e8d6] fra [info] Using Consul to determine primary 2022-11-16T00:00:51.571 app[b871e8d6] fra [info] initializing consul: key= url=https://:e56d6a0c-6c59-e30e-fc67-3e412f38afb9@consul-fra-3.fly-shared.net/fontsource-6r85yqled4r92pvl/ hostname=b871e8d6 advertise-url=http://b871e8d6.vm.fontsource.internal:20202 2022-11-16T00:00:51.580 app[b871e8d6] fra [info] LiteFS mounted to: /litefs/data 2022-11-16T00:00:51.580 app[b871e8d6] fra [info] http server listening on: http://localhost:20202 2022-11-16T00:00:51.580 app[b871e8d6] fra [info] waiting to connect to cluster 2022-11-16T00:00:52.155 app[b871e8d6] fra [info] existing primary found (75c0f15a), connecting as replica 2022-11-16T00:00:52.166 app[75c0f15a] fra [info] stream connected 2022-11-16T00:00:52.166 app[75c0f15a] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:00:52.166 app[75c0f15a] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:00:52.174 app[b871e8d6] fra [info] snapshot received for "sqlite.db", removing other ltx files: 0000000000000001-0000000000000008.ltx 2022-11-16T00:00:52.175 app[b871e8d6] fra [info] connected to cluster, ready 2022-11-16T00:00:52.175 app[b871e8d6] fra [info] starting subprocess: node [./start.js] 2022-11-16T00:00:52.179 app[b871e8d6] fra [info] waiting for signal or subprocess to exit 2022-11-16T00:00:52.413 app[b871e8d6] fra [info] Found primary instance in .primary file: 75c0f15a 2022-11-16T00:00:52.414 app[b871e8d6] fra [info] Instance (b871e8d6) in fra is not primary (the primary instance is 75c0f15a). Skipping migrations. 2022-11-16T00:00:52.414 app[b871e8d6] fra [info] Starting app... 2022-11-16T00:00:55.143 app[b871e8d6] fra [info] Remix App Server started at http://localhost:3000 (http://172.19.17.90:3000) 2022-11-16T00:01:35.731 runner[81db970b] sjc [info] Shutting down virtual machine 2022-11-16T00:01:35.812 app[81db970b] sjc [info] Sending signal SIGINT to main child process w/ PID 520 2022-11-16T00:01:35.814 app[81db970b] sjc [info] sending signal to exec process 2022-11-16T00:01:35.814 app[81db970b] sjc [info] waiting for exec process to close 2022-11-16T00:01:35.814 app[81db970b] sjc [info] signal received, litefs shutting down 2022-11-16T00:01:35.816 app[81db970b] sjc [info] replica disconnected with error, retrying: next frame: context canceled 2022-11-16T00:01:35.816 app[81db970b] sjc [info] litefs shut down complete 2022-11-16T00:01:35.878 runner[75c0f15a] fra [info] Shutting down virtual machine 2022-11-16T00:01:35.890 app[75c0f15a] fra [info] stream disconnected 2022-11-16T00:01:35.982 app[75c0f15a] fra [info] Sending signal SIGINT to main child process w/ PID 521 2022-11-16T00:01:35.985 app[b871e8d6] fra [info] replica disconnected, retrying 2022-11-16T00:01:35.984 app[75c0f15a] fra [info] sending signal to exec process 2022-11-16T00:01:35.984 app[75c0f15a] fra [info] waiting for exec process to close 2022-11-16T00:01:35.984 app[75c0f15a] fra [info] signal received, litefs shutting down 2022-11-16T00:01:35.984 app[75c0f15a] fra [info] stream disconnected 2022-11-16T00:01:35.985 app[75c0f15a] fra [info] exiting primary, destroying lease 2022-11-16T00:01:36.457 app[75c0f15a] fra [info] exit status 1: fusermount: failed to unmount /litefs/data: Device or resource busy 2022-11-16T00:01:36.638 app[81db970b] sjc [info] Starting clean up. 2022-11-16T00:01:37.407 app[75c0f15a] fra [info] Starting clean up. 2022-11-16T00:01:38.169 app[b871e8d6] fra [info] primary lease acquired, advertising as http://b871e8d6.vm.fontsource.internal:20202 2022-11-16T00:01:39.247 runner[dbc7740c] sjc [info] Starting instance 2022-11-16T00:01:39.577 runner[dbc7740c] sjc [info] Configuring virtual machine 2022-11-16T00:01:39.579 runner[dbc7740c] sjc [info] Pulling container image 2022-11-16T00:01:45.398 runner[dbc7740c] sjc [info] Unpacking image 2022-11-16T00:01:46.792 runner[dbc7740c] sjc [info] Preparing kernel init 2022-11-16T00:01:47.105 runner[dbc7740c] sjc [info] Configuring firecracker 2022-11-16T00:01:47.300 runner[dbc7740c] sjc [info] Starting virtual machine 2022-11-16T00:01:47.539 app[dbc7740c] sjc [info] Starting init (commit: 81d5330)... 2022-11-16T00:01:47.567 app[dbc7740c] sjc [info] Preparing to run: `docker-entrypoint.sh litefs -- node ./start.js` as root 2022-11-16T00:01:47.593 app[dbc7740c] sjc [info] 2022/11/16 00:01:47 listening on [fdaa:0:9732:a7b:b385:dbc7:740c:2]:22 (DNS: [fdaa::3]:53) 2022-11-16T00:01:47.644 app[dbc7740c] sjc [info] config file read from /app/litefs.yml 2022-11-16T00:01:47.644 app[dbc7740c] sjc [info] LiteFS commit=711005b51b46f17caec55a4af60459b53b52d0c7 2022-11-16T00:01:47.644 app[dbc7740c] sjc [info] Using Consul to determine primary 2022-11-16T00:01:49.723 app[dbc7740c] sjc [info] initializing consul: key= url=https://:e56d6a0c-6c59-e30e-fc67-3e412f38afb9@consul-fra-3.fly-shared.net/fontsource-6r85yqled4r92pvl/ hostname=dbc7740c advertise-url=http://dbc7740c.vm.fontsource.internal:20202 2022-11-16T00:01:49.730 app[dbc7740c] sjc [info] LiteFS mounted to: /litefs/data 2022-11-16T00:01:49.730 app[dbc7740c] sjc [info] http server listening on: http://localhost:20202 2022-11-16T00:01:49.730 app[dbc7740c] sjc [info] waiting to connect to cluster 2022-11-16T00:01:50.045 app[dbc7740c] sjc [info] existing primary found (b871e8d6), connecting as replica 2022-11-16T00:01:50.284 app[b871e8d6] fra [info] stream connected 2022-11-16T00:01:50.284 app[b871e8d6] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:01:50.284 app[b871e8d6] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:01:50.284 app[b871e8d6] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-16T00:01:50.284 app[b871e8d6] fra [info] stream disconnected 2022-11-16T00:01:50.364 app[dbc7740c] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-16T00:01:52.467 app[dbc7740c] sjc [info] existing primary found (b871e8d6), connecting as replica 2022-11-16T00:01:52.542 app[b871e8d6] fra [info] stream connected 2022-11-16T00:01:52.542 app[b871e8d6] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:01:52.542 app[b871e8d6] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:01:52.542 app[b871e8d6] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-16T00:01:52.542 app[b871e8d6] fra [info] stream disconnected 2022-11-16T00:01:52.621 app[dbc7740c] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-16T00:01:53.861 app[dbc7740c] sjc [info] existing primary found (b871e8d6), connecting as replica 2022-11-16T00:01:53.936 app[b871e8d6] fra [info] stream connected 2022-11-16T00:01:53.936 app[b871e8d6] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:01:53.936 app[b871e8d6] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:01:53.937 app[b871e8d6] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-16T00:01:53.937 app[b871e8d6] fra [info] stream disconnected 2022-11-16T00:01:54.016 app[dbc7740c] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-16T00:01:55.333 app[dbc7740c] sjc [info] existing primary found (b871e8d6), connecting as replica 2022-11-16T00:01:55.407 app[b871e8d6] fra [info] stream connected 2022-11-16T00:01:55.407 app[b871e8d6] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:01:55.407 app[b871e8d6] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:01:55.407 app[b871e8d6] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-16T00:01:55.407 app[b871e8d6] fra [info] stream disconnected 2022-11-16T00:01:55.486 app[dbc7740c] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-16T00:01:56.639 app[dbc7740c] sjc [info] existing primary found (b871e8d6), connecting as replica 2022-11-16T00:01:56.713 app[b871e8d6] fra [info] stream connected 2022-11-16T00:01:56.713 app[b871e8d6] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:01:56.713 app[b871e8d6] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:01:56.713 app[b871e8d6] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-16T00:01:56.713 app[b871e8d6] fra [info] stream disconnected 2022-11-16T00:01:56.793 app[dbc7740c] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-16T00:01:58.099 app[dbc7740c] sjc [info] existing primary found (b871e8d6), connecting as replica 2022-11-16T00:01:58.174 app[b871e8d6] fra [info] stream connected 2022-11-16T00:01:58.174 app[b871e8d6] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:01:58.174 app[b871e8d6] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:01:58.174 app[b871e8d6] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-16T00:01:58.174 app[b871e8d6] fra [info] stream disconnected 2022-11-16T00:01:58.254 app[dbc7740c] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-16T00:01:59.573 app[dbc7740c] sjc [info] existing primary found (b871e8d6), connecting as replica 2022-11-16T00:01:59.648 app[b871e8d6] fra [info] stream connected 2022-11-16T00:01:59.648 app[b871e8d6] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:01:59.648 app[b871e8d6] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:01:59.648 app[b871e8d6] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-16T00:01:59.648 app[b871e8d6] fra [info] stream disconnected 2022-11-16T00:01:59.727 app[dbc7740c] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-16T00:02:01.015 app[dbc7740c] sjc [info] existing primary found (b871e8d6), connecting as replica 2022-11-16T00:02:01.090 app[b871e8d6] fra [info] stream connected 2022-11-16T00:02:01.090 app[b871e8d6] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:02:01.090 app[b871e8d6] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:02:01.090 app[b871e8d6] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-16T00:02:01.090 app[b871e8d6] fra [info] stream disconnected 2022-11-16T00:02:01.169 app[dbc7740c] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-16T00:02:02.325 app[dbc7740c] sjc [info] existing primary found (b871e8d6), connecting as replica 2022-11-16T00:02:02.400 app[b871e8d6] fra [info] stream connected 2022-11-16T00:02:02.400 app[b871e8d6] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:02:02.400 app[b871e8d6] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:02:02.400 app[b871e8d6] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-16T00:02:02.400 app[b871e8d6] fra [info] stream disconnected 2022-11-16T00:02:02.480 app[dbc7740c] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-16T00:02:03.726 app[dbc7740c] sjc [info] existing primary found (b871e8d6), connecting as replica 2022-11-16T00:02:03.801 app[b871e8d6] fra [info] stream connected 2022-11-16T00:02:03.801 app[b871e8d6] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:02:03.802 app[b871e8d6] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:02:03.802 app[b871e8d6] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-16T00:02:03.802 app[b871e8d6] fra [info] stream disconnected 2022-11-16T00:02:03.882 app[dbc7740c] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file 2022-11-16T00:02:05.080 app[dbc7740c] sjc [info] existing primary found (b871e8d6), connecting as replica 2022-11-16T00:02:05.154 app[b871e8d6] fra [info] stream connected 2022-11-16T00:02:05.154 app[b871e8d6] fra [info] transaction file for txid 0000000000000001 no longer available, writing snapshot 2022-11-16T00:02:05.155 app[b871e8d6] fra [info] writing snapshot "sqlite.db" @ 0000000000000008 2022-11-16T00:02:05.155 app[b871e8d6] fra [info] http: error: stream error: db="sqlite.db" err=stream ltx (0000000000000001): write ltx snapshot file: encode ltx header: commit record required 2022-11-16T00:02:05.155 app[b871e8d6] fra [info] stream disconnected 2022-11-16T00:02:05.234 app[dbc7740c] sjc [info] replica disconnected with error, retrying: process ltx stream frame: peek ltx header: unmarshal header: invalid LTX file ```

Unfortunately, #187 didn't work. Looking at the logs, the first instance deployed fine but things broke deploying the second instance.

I'll try to get a minimal reproduction for you forking litefs-indie and strip out Prisma for knex for simplicity, however, I probably can't get it to you today, unfortunately.

benbjohnson commented 1 year ago

@ayuhito No problem! Thanks for all the help trying to debug this issue.

benbjohnson commented 1 year ago

@ayuhito I was able to reproduce on my side. I'll get it fixed up. 👍

benbjohnson commented 1 year ago

@ayuhito Ok, it looks like this PR fixes the issue. The database size is kept in-memory but it looks like I missed a spot where it should update when receiving an LTX file from a primary. Then when a deploy happens without a volume then the old replica becomes primary and the new replica tries to snapshot (since it doesn't have a persistent volume) and it hits this missing page count issue.

I've done a bunch of deploys and I haven't hit it again. I'm going to close the issue for now but let me know if you still experience the issue. Thanks again for all your help!!

ayuhito commented 1 year ago

@benbjohnson, it works great! Thank you so much for your time!