superfly / litefs

FUSE-based file system for replicating SQLite databases across a cluster of machines
Apache License 2.0
3.82k stars 93 forks source link

Getting non-matching LTX checksum on fresh volume #134

Closed kentcdodds closed 1 year ago

kentcdodds commented 1 year ago

https://github.com/kentcdodds/kentcdodds.com/actions/runs/3316512422/jobs/5478478215

cannot open store: open databases: open database("sqlite.db"): verify database file: database checksum (e3d3906d74cc0273) does not match latest LTX checksum (0000000000000000)

This volume is brand new and completely empty. @benbjohnson said this is a bug that needs fixing and asked me to open this issue. More context at https://www.youtube.com/watch?v=vTNPJGKqsYQ

Thanks!

benbjohnson commented 1 year ago

@kentcdodds Thanks for writing this up. I realized I had an old version (pr-109) still on the litefs-example which it looks like you have in your Dockerfile as well. Sorry about that. Can you try changing this line here to:

FROM flyio/litefs:0.2 AS litefs

I'm surprised to see that error on a brand new volume as it happens when LiteFS is validating the existing database state. Can you retry with the new litefs version and let me know if you still have the same issue?

kentcdodds commented 1 year ago

Thanks! I've still go the same issue: https://github.com/kentcdodds/kentcdodds.com/actions/runs/3322212924/jobs/5491067040

kentcdodds commented 1 year ago

I'm a bit stuck on deploying LiteFS until this is resolved. Any ideas?

benbjohnson commented 1 year ago

@kentcdodds The error is strange because it is essentially saying that the database state on disk exists (checksum e3d3906d74cc0273) but there's no associated replication data (checksum 0000000000000000). However, you're seeing that error even when you deploy with a clean volume so there shouldn't be any database state.

2022-10-25T15:35:44Z   [info]cannot open store: open databases: open database("sqlite.db"): verify database file: database checksum (e3d3906d74cc0273) does not match latest LTX checksum (0000000000000000)

Can you try removing the volumes on your staging set up and re-deploying and seeing if you still have the same error?

kentcdodds commented 1 year ago

I think I've figured out what's going on. When I create the new volume, my old (pre litefs) app restarts and applies migrations to the new db in the volume which is what causes this issue.

What I'm trying now is to deploy a version of my app that does not do anything to the database so then I can have that one running when I recreate the volume, and then deploy the litefs version. Will let you know what happens.

benbjohnson commented 1 year ago

Ok, cool. Thanks for digging into it more. I also created an issue for keeping litefs running on error so it's easier to ssh in and debug the state. https://github.com/superfly/litefs/issues/136

benbjohnson commented 1 year ago

I pushed up a PR for it so it's available at pr-137 in Docker now. That'll keep litefs running even if it hits some kind of error on startup so the fly instance will be accessible via ssh.

kentcdodds commented 1 year ago

Good news! It's running now.

Now I'm going to try to create more regions. It just occurred to me that I'll need to create volumes for the regions first right? If I try to deploy my app to a region without a persistent volume things will break right?

kentcdodds commented 1 year ago

Interestingly, I added a volume to maa, and then tried to add a region there and got this error message:

Error App 'kcd-staging' uses volumes to control regions. Add or remove volumes to change region placement.

So I just scaled up to a count of 2 and maa started right up! Wahoo! Thanks a ton for the help!

Now I just need to figure out how to determine the primary region via that .primary file and then I think I should be ready to go with this to prod!

kentcdodds commented 1 year ago

I got this again on a new deploy of the app:

2022-10-25T22:33:59.223 app[18d0f7c2] den [info] ERROR: cannot open store: open databases: open database("sqlite.db"): verify database file: database checksum (f13013272ddb586c) does not match latest LTX checksum (da9624ecbb43ad42)

I'm not sure what I'm doing wrong :(

kentcdodds commented 1 year ago

Here's the failed build, not sure how useful it'll be: https://github.com/kentcdodds/kentcdodds.com/actions/runs/3324709637/jobs/5496681848

benbjohnson commented 1 year ago

@kentcdodds This is a known bug that can occur on restart with the rollback journal. I have a fix for this one. We should have a v0.3.0 release coming early next week that will have WAL support and stability fixes in it.

kentcdodds commented 1 year ago

In case, it's helpful, I tried the SHA release of litefs just now and got the same error:

https://github.com/kentcdodds/kentcdodds.com/actions/runs/3351908722/jobs/5553715931#step:6:78

benbjohnson commented 1 year ago

@kentcdodds Thanks for trying it. Is this running on a clean volume or the existing one?

kentcdodds commented 1 year ago

Existing one

benbjohnson commented 1 year ago

I added a possible fix for this with https://github.com/superfly/litefs/pull/157. Although, depending on the exact nature of the issue, https://github.com/superfly/litefs/pull/158 could help too. It's hard to say without looking at the data files in the LiteFS directory.

This may resolve the issue on the existing volume but if it's a bug that was resolved by https://github.com/superfly/litefs/pull/158 then you'll need to wipe the volume and start with a clean database.

I'm going to close this for now but please reopen if you hit the issue again. Thanks, @kentcdodds!

AlexBlokh commented 1 year ago

well, my production instance is now dead, been up for 7 months seems like due to that issue

I have 1 container and 1 volume

AlexBlokh commented 1 year ago

neither I'm able to connect to the instance, it's in pending state

benbjohnson commented 1 year ago

@AlexBlokh I'm sorry to hear that. Do you know what version of LiteFS you were running?

Also, you may be able to recover your underlying database. If you copy out the database file and wal file to a different directory with SQLite standard names, you can open in SQLite and then do an integrity check to ensure it's valid:

# Replace LITEFS_DATA_DIR & #DBNAME with your appropriate values.
# You only need to copy the "wal" file if you're using WAL mode and if the file exists.
$ cp $LITEFS_DATA_DIR/dbs/$DBNAME/database /tmp/db
$ cp $LITEFS_DATA_DIR/dbs/$DBNAME/wal /tmp/db-wal

# Open using SQLite & run an integrity check.
$ sqlite3 /tmp/db
sqlite> PRAGMA integrity_check

If it returns ok then the underlying database is valid. If it returns errors then you'll need to recover from a backup.