rails / solid_queue

Database-backed Active Job backend
MIT License
1.77k stars 103 forks source link

SQLite queue database corruption #324

Open andycroll opened 2 weeks ago

andycroll commented 2 weeks ago

I'm seeing a strange error. I dropped @fractaledmind a note and he suggested I post here.

W, [2024-09-05T18:49:32.074281 #809854]  WARN -- : SolidQueue-0.3.4 Error deregistering Supervisor (1.1ms)  process_id: 68, pid: 564289, hostname: "HOST", last_heartbeat_at: Sat, 31 Aug 2024 15:16:55.474066000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 68>, :pruned=>true, :claimed_size=>0, :error=>#<ActiveRecord::StatementInvalid: SQLite3::CorruptException: database disk image is malformed>, :exception=>["ActiveRecord::StatementInvalid", "SQLite3::CorruptException: database disk image is malformed"], :exception_object=>#<ActiveRecord::StatementInvalid: SQLite3::CorruptException: database disk image is malformed>}, error: "ActiveRecord::StatementInvalid SQLite3::CorruptException: database disk image is malformed"

In sqlite3 client.

sqlite> .open production-queue.sqlite3
sqlite> pragma integrity_check;
*** in database main ***
On tree page 523 cell 23: Rowid 146 out of order
NULL value in solid_queue_jobs.created_at
NULL value in solid_queue_jobs.updated_at
row 146 missing from index index_solid_queue_jobs_for_alerting
row 146 missing from index index_solid_queue_jobs_for_filtering
row 146 missing from index index_solid_queue_jobs_on_finished_at
row 146 missing from index index_solid_queue_jobs_on_active_job_id
row 146 missing from index index_solid_queue_jobs_on_class_name
NULL value in solid_queue_jobs.created_at
NULL value in solid_queue_jobs.updated_at
row 147 missing from index index_solid_queue_jobs_for_alerting
row 147 missing from index index_solid_queue_jobs_for_filtering
row 147 missing from index index_solid_queue_jobs_on_finished_at
row 147 missing from index index_solid_queue_jobs_on_active_job_id
row 147 missing from index index_solid_queue_jobs_on_class_name
NULL value in solid_queue_jobs.created_at
NULL value in solid_queue_jobs.updated_at
row 148 missing from index index_solid_queue_jobs_for_alerting
row 148 missing from index index_solid_queue_jobs_for_filtering
row 148 missing from index index_solid_queue_jobs_on_finished_at
row 148 missing from index index_solid_queue_jobs_on_active_job_id
row 148 missing from index index_solid_queue_jobs_on_class_name
NULL value in solid_queue_jobs.created_at
NULL value in solid_queue_jobs.updated_at
row 149 missing from index index_solid_queue_jobs_for_alerting
row 149 missing from index index_solid_queue_jobs_for_filtering
row 149 missing from index index_solid_queue_jobs_on_finished_at
row 149 missing from index index_solid_queue_jobs_on_active_job_id
row 149 missing from index index_solid_queue_jobs_on_class_name
Error: stepping, database disk image is malformed (11)
sqlite> pragma integrity_check;
*** in database main ***
On tree page 523 cell 23: Rowid 146 out of order
NULL value in solid_queue_jobs.created_at
NULL value in solid_queue_jobs.updated_at
row 146 missing from index index_solid_queue_jobs_for_alerting
row 146 missing from index index_solid_queue_jobs_for_filtering
row 146 missing from index index_solid_queue_jobs_on_finished_at
row 146 missing from index index_solid_queue_jobs_on_active_job_id
row 146 missing from index index_solid_queue_jobs_on_class_name
Error: stepping, database disk image is malformed (11)

Doing a .clone doesn't seem to fix.

I'm not in "real" production and this is a job queue and thus somewhat destroyable, so I'm going to take that approach.

I've also seen similar corruption on my local development environment too akin to that mentioned by @marcoroth on X.

https://x.com/marcoroth_/status/1831045885423497223

In that case I just destroyed and recreated the file and moved on.

alxvernier commented 1 week ago

@rosa yes, and now I can't reproduce it 🫠

rosa commented 1 week ago

Complete guess, but I wonder if the WAL is corrupted and hasn't been checkpointed back to main db file yet. Maybe we should try manually check pointing to get everything in the main db file, then run the integrity check πŸ€·β€β™‚οΈ

Looks like @npezza93's WAL file is empty, does this mean that whatever was in there should now be in the main db file?

yes, and now I can't reproduce it 🫠

😭 😭 I've been also trying myself, hibernating/suspending my laptop in different ways with different values for hibernate mode even, and nothing, I never get the SQLite3::CorruptException.

rosa commented 1 week ago

I still haven't managed to reproduce this 😭 But I've come up with a fix for the suspended processes continuing to run with deleted process records: https://github.com/rails/solid_queue/pull/337. I don't know if that would stop the SQLite3::CorruptException errors because I haven't managed to get those, but it will at least stop the UPDATE queries that get rolled back because the record being updated is gone.

npezza93 commented 1 week ago

@rosa Awesome ill give the branch a try!

marcoroth commented 1 week ago

I just got a SQLite3::CorruptException shortly after waking up my Mac πŸ‘€ Going to collect whatever could be useful...

One weird thing I noticed, my mac was lagging for a bit and then I saw the exception. My guess is that it could have to do when the system is running low on available resources.

marcoroth commented 1 week ago

I think there are two issues, this SQLite3::CorruptException I hit just now is closer to what @alvernier is describing. It happened shortly after waking up my mac, which happened to me for the first time now. The other ones I encountered I could easily reproduce and didn't involve waking up my MacBook.

But after it crashed this time I could run rails s again and it wasn't complaining about the corrupted file. The pragma check is confirming that the database is actually in tact.

❯ sqlite3
SQLite version 3.43.2 2023-10-10 13:08:14
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .open ./storage/development-queue.sqlite3
sqlite> pragma integrity_check;
ok

Which leads me to believe that these are two separate issues having very similar symptoms.

App: RubyVideo on fb0a8c7 Ruby: 3.3.4 Rails: 7.2.0 SolidQueue: 0.5.0 sqlite3: 3.43.2 2023-10-10 13:08:14 1b37c146ee9ebb7acd0160c0ab1fd11017a419fa8a3187386ed8cb32b709aapl (64-bit) sqlite3 gem: sqlite3-2.0.3-x86_64-darwin

Here are the full-logs, from rails s to ctrl-c-ing to exception.

I tried to annotate them to give more context. Note: My system time is on CEST (UTC+2), db queries are on UTC.

Time (CEST) What happened Logs Line Number
~21:02:10 Starting Rails server using rails s L2
21:04:21 First Request hits Rails server L54Β 
~21:04:43Β  I leave my MacBook Β L164
10 mins later (around 21:14:xx Displays should have turned off themesleves (according to my macOS settings) around L264
~23:23:xx I return to my MacBook and unlock it, The /feed.xml is from a local RSS subscription in the RSS reader app, I think it did fetch the file again as soon as I unlocked my mac Β L707
23:23:xx I return to my terminal and hit Ctrl-C the first time L720
23:23:xx Ctrl-C for the second time L721
23:23:xx Ctrl-C for the third time L727
23:23:xx Β Transaction rollback Β L731
23:23:xx SQLite3::CorruptException is thrown Β L735
rosa commented 1 week ago

@marcoroth huge thanks for that πŸ™ That's super helpful, so detailed!

I agree with you; there seem to be two different issues here. @andycroll's and yours in the beginning, and now this that happened to @npezza93, @alvernier and you, related to the wake-up of an asleep Mac. What I don't get about this one is why that error is raised if the DB is not actually corrupted. In fact, in your logs, there are more queries successfully performed after you get that error. I haven't managed to reproduce this, frustratingly, but I'm hoping that whatever is confusing SQLite to think it's corrupted disappears if I avoid the concurrent deletions + updates of these process records πŸ€” Although I can't know for sure because I don't understand why the error happens at all. I'd expect a SQLite3::BusyException there, perhaps, which is all I've managed to get.

In any case, the second error, the one when the Mac wakes up, doesn't worry me as much as the first one because the DB is not actually corrupted and because it wouldn't happen in production.

As for the first, more serious error, I still haven't managed to reproduce anything, not even in version 0.3.4. I've also been trying to get connections to leak across forks and threads with Active Record, but I haven't been able to πŸ˜•

What an annoying bug πŸ˜…

rosa commented 1 week ago

Just a quick update on this one: I've been busy the last couple of days with other work stuff but I plan to come back to this tomorrow. I've spent more time learning about how Active Record prevents connections from leaking across forks as well, as this was something I wasn't familiar with (I only knew it was supposed to handle it). I'll also test a bit more #337 in production and will merge it, and then will dedicate most of the effort to try to reproduce the original error with the real corruption.

rosa commented 1 week ago

@andycroll, if you have a chance, could you let me know for when this error happened to you in production:

I don't really know about SQLite internals, but I'd think corruption that happens because some connection is closed when it shouldn't be would affect tables that are being written at the moment, unlike corruption caused by some underlying disk issue, for example, which would break something randomly. In @marcoroth's error, the table is solid_queue_processes, which is clearly being written by Solid Queue on shutdown, as it does the process cleanup. Other tables written during that process are the ones for ready and claimed executions. But the jobs table is never written then πŸ€” The corruption error in your case:

*** in database main ***
On tree page 523 cell 23: Rowid 146 out of order
NULL value in solid_queue_jobs.created_at
NULL value in solid_queue_jobs.updated_at
row 146 missing from index index_solid_queue_jobs_for_alerting
row 146 missing from index index_solid_queue_jobs_for_filtering
row 146 missing from index index_solid_queue_jobs_on_finished_at
row 146 missing from index index_solid_queue_jobs_on_active_job_id
row 146 missing from index index_solid_queue_jobs_on_class_name
NULL value in solid_queue_jobs.created_at
NULL value in solid_queue_jobs.updated_at

makes me think of a new job being created and that creation aborted midway, so that created_at and updated_at aren't populated, and the indexes aren't written πŸ€” Or, it could also be a job that's getting deleted and the row is still there but the corresponding values are already deleted from the indexes. But this is quite a wild guess because I don't know how this works internally and haven't managed to understand enough today.

I've continued experimenting with the https://github.com/fractaledmind/solid_queue_bughunt app, this time in Solid Queue 0.3.4. In version 0.3.4, the signal handlers (for INT, QUIT, TERM) are restored to their default handlers too soon, before finishing the actual shutdown, when there are still DB operations to perform, so I'm trying to have the process receive a handful of signals after the default handlers have been restored, to see if that trickles any bad behaviour, but so far I haven't been able to.

In version 0.5.0, the one used in the rubyvideo repo, that version already restored the signal handlers after shutdown, as the last thing, but in that case, considering that two Ctrl+C in quick succession (which I think would be two INT signals) were needed to trigger the corruption, it makes me think is something else πŸ€”

flavorjones commented 1 week ago

:wave: Hey all. I spent a chunk of time yesterday trying to reproduce this. My hunch after reading this issue and How To Corrupt An SQLite Database File is that it might have to do with process forking.

So that's where I focused my time, and it paid off. I'm able to easily reproduce at least some form of database corruption using nothing but active record and forking. Some caveats, though:

That said, here's the ad-hoc repro script that reproduces corruption on my system:

https://gist.github.com/flavorjones/1f1dffa6bdd6ab81a227d10d73891f34

The broader context is that I run it in a vanilla rails app with a scaffolded Post model, and I run it like this:

while ./fork-safety.rb ; do echo again ; done

It exits with nonzero status if the sqlite integrity check fails. I see lots of exceptions ("disk I/O error") but those don't always result in corruption, so I run it in a loop and reliably reproduce database corruption in under a minute.

I will post a full Rails app repo later this morning, with more comments, etc. but the corruption only happens when BOTH of these conditions are true:

So if the parent waits for the children to exit: no problem. If the parent closes its connections before it forks: no problem.

Please note that in this scenario the children aren't even using the database connection inherited from the parent process -- Rails is doing the right thing and creating new connections, as instructed in https://www.sqlite.org/howtocorrupt.html, and the problem still happens.

It's still very early in the investigation, but my current hypothesis is that something is going on in shared memory space when the parent exits that then causes problems in the child processes. This may or may not be intended behavior but I'd like to isolate this with a reproduction in C and move the conversation upstream to the sqlite forum.

rosa commented 1 week ago

This is epic; thanks a lot @flavorjones πŸ™‡β€β™€οΈ πŸ†

I think the conditions you describe could very well explain the very first issue from @andycroll. From your two conditions:

the parent has an open sqlite connection when it forks, and the parent exits before the children have finished their work

The first is always true for Solid Queue. The second shouldn't be, generally, as the supervisor waits for its supervised processes before exiting, but depending on how the deployment worked (e.g. grace period and then SIGKILL?), it might have been terminated before the children exited, if it was terminated externally.

However, it wouldn't explain the other issues in macOS (like @marcoroth's example, no the examples where the computer goes to sleep; those are separate for sure), as I haven't been able to get a corrupted DB with your script after over one hour πŸ˜… @marcoroth mentioned:

Is there any chance that this is dependent on the sqlite3 gem variant you get?

I'm running the x86_64-darwin one (and I assume @andycroll is probably also running a x86_64-* variant in production) whereas most of you are probably running a arm64-* variant on your local dev machines.

Maybe it has to do with that... I might try to reproduce with your script and x86_64-darwin on Mac.

flavorjones commented 1 week ago

~One more note amending what I wrote above: I can sometimes reproduce this even when the parent has closed all its connections, and even when it has created zero connections. I honestly don't know what to make of this fact yet and it seems like absolutely bonkers spooky-action-at-a-distance.~

Ignore this, it was a procedural mistake on my part.

flavorjones commented 1 week ago

Repository with a full Rails app reproduction here: https://github.com/flavorjones/2024-09-13-sqlite-corruption

rosa commented 1 week ago

Ok, I finally could reproduce this with @flavorjones's test on macOS. Yesterday when I was testing that I didn't realise I was running this version of SQLite3

1.7.2: arm64-darwin

Today, I've got the corruption error with both:

2.0.4: arm64-darwin
2.0.4: x86_64-darwin

I'm doing other tests now with different versions, there's also a difference if I reduce the number of children to 3 like yesterday.

Update: yes, the number of children makes a difference. With N=10, I get the corruption with older versions of the sqlite3 gem (tested with 1.5.0).

flavorjones commented 1 week ago

Git bisecting, this is the PR/commit where this behavior changed:

https://github.com/sparklemotion/sqlite3-ruby/pull/392 / https://github.com/sparklemotion/sqlite3-ruby/commit/0c24631d5b8c9b2e361b1a0bbfc1823a18793b71

which has to do with the lifecycle of database connections and prepared statements. Getting closer.

flavorjones commented 1 week ago

OK, I understand what's happening. Problems result when an inherited (from the parent process) database connection is sqlite3_close()ed or a prepared statement is sqlite3_finalize()ed in the child process.

In a Rails app it seems much more likely to be a statement being finalized because of the cached statements, but it can happen with the database objects as well.

I think we should be able to work around this now that I understand the mechanism.

fractaledmind commented 1 week ago

OMG! You two are such absolute legends! @flavorjones you are a complete baller. @rosa you are a total magician. I am so grateful to get to work in the general vicinity of you both. I will buy you both some kind of drink in Toronto to show my thanks.

flavorjones commented 1 week ago

OK, I hacked on the sqlite3 adapter so that it closes database connections before forking, and that seems to prevent the data corruption and disk i/o errors. I think we just need to polish the patch and land it in Rails and everything will be OK! Stay tuned.

rosa commented 1 week ago

Aww @fractaledmind you're too kind! This was all @flavorjones, he's the magician πŸͺ„. Plus it's me who's beyond grateful to you and everyone here for the help, it's a privilege to have people like you in the community πŸ™‡πŸ»β€β™€οΈ I'm definitely the one getting you drinks at Rails World! πŸ˜† and Marco, Nick, Andy... πŸ₯ΉπŸ™πŸ»

dhh commented 1 week ago

Epic team work, all 🀘. Sqlite is going to be a much bigger role in Rails apps going forward.

flavorjones commented 6 days ago

See https://github.com/rails/rails/pull/52931 which should address at least one of the issues discussed in this thread (fork-safety). I'd love to get some feedback on it!

flavorjones commented 5 days ago

Changed direction, here's the latest plan, I'd love feedback:

flavorjones commented 3 days ago

@andycroll @marcoroth @fractaledmind I just cut a release candidate of sqlite3-ruby with the fork-safety changes from https://github.com/sparklemotion/sqlite3-ruby/pull/558:

https://github.com/sparklemotion/sqlite3-ruby/releases/tag/v2.1.0.rc1

It's an invasive change, it's been a hard issue to reproduce, and it's challenging to test every possible scenario ... so I would really like to have some folks use it and see if we can find problems and/or build confidence.

I've tried out the solid_queue_bughunt repo and it seems like it's working OK (and I can see the warning messages from sqlite3-ruby in the log which means it's doing something βœ”). But the more scenarios we can test with the better. If you could just drop a note here if you've been able to test (or if you intend to test), I would :heart: you forever.

flavorjones commented 1 day ago

Released https://github.com/sparklemotion/sqlite3-ruby/releases/tag/v2.1.0.rc2 addressing a performance regression in rc1.