groue / GRDB.swift

A toolkit for SQLite databases, with a focus on application development
MIT License
6.78k stars 702 forks source link

0xdead10cc prevention — `checkForSuspensionViolation` on every `Statement.step` #1538

Closed nkbelov closed 5 months ago

nkbelov commented 5 months ago

What did you do?

TestFlight version of an app crashed with 0xdead10cc; stacktrace:

Thread 2:
0   libsystem_kernel.dylib          0x00000001e6f5a654 pread + 8
1   libsqlite3.dylib                0x00000001c876fe70 seekAndRead + 96 (sqlite3.c:44001)
2   libsqlite3.dylib                0x00000001c86e4508 unixRead + 208 (sqlite3.c:44093)
3   libsqlite3.dylib                0x00000001c86f7918 readDbPage + 188 (sqlite3.c:66757)
4   libsqlite3.dylib                0x00000001c8779ee0 getPageNormal + 544 (sqlite3.c:69366)
5   libsqlite3.dylib                0x00000001c87867dc getAndInitPage + 124 (sqlite3.c:79030)
6   libsqlite3.dylib                0x00000001c8783b1c sqlite3BtreeIndexMoveto + 1392 (sqlite3.c:82785)
7   libsqlite3.dylib                0x00000001c870a7b0 sqlite3VdbeExec + 4724 (sqlite3.c:105078)
8   libsqlite3.dylib                0x00000001c8708c80 sqlite3_step + 964 (sqlite3.c:97699)
9   XXXXX                           0x000000010353a2f8 specialized Statement.step<A>(_:) + 24 (Statement.swift:614)
10  XXXXX                           0x000000010353a2f8 specialized DatabaseCursor.next() + 120 (Statement.swift:745)
11  XXXXX                           0x00000001036e08d4 closure #1 in MutablePersistableRecord.upsertAndFetchWithoutCallbacks<A>(_:onConflict:doUpdate:selection:decode:) + 92 (MutablePersistableRecord+Upsert.swift:430)
12  XXXXX                           0x00000001036e1084 partial apply for closure #1 in MutablePersistableRecord.upsertAndFetchWithoutCallbacks<A>(_:onConflict:doUpdate:selection:decode:) + 24 (<compiler-generated>:0)
13  libswiftCore.dylib              0x000000019d480428 withExtendedLifetime<A, B>(_:_:) + 28 (LifetimeManager.swift:44)
14  XXXXX                           0x00000001036e06f0 MutablePersistableRecord.upsertAndFetchWithoutCallbacks<A>(_:onConflict:doUpdate:selection:decode:) + 696 (MutablePersistableRecord+Upsert.swift:429)
15  XXXXX                           0x0000000103092310 specialized closure #1 in PersistableRecord.upsertAndFetchWithCallbacks<A>(_:onConflict:doUpdate:selection:decode:) + 88 (<compiler-generated>:0)
16  XXXXX                           0x0000000103092310 specialized MutablePersistableRecord.aroundInsert(_:insert:) + 88 (<compiler-generated>:0)
17  XXXXX                           0x0000000103092310 protocol witness for MutablePersistableRecord.aroundInsert(_:insert:) in conformance ChatEvent + 88 (<compiler-generated>:0)
18  XXXXX                           0x0000000103092310 specialized PersistableRecord.upsertAndFetchWithCallbacks<A>(_:onConflict:doUpdate:selection:decode:) + 88 (<compiler-generated>:0)
19  XXXXX                           0x0000000103092310 specialized PersistableRecord.upsertWithCallbacks(_:) + 88 (<compiler-generated>:0)
20  XXXXX                           0x0000000103092310 specialized closure #1 in PersistableRecord.upsert(_:) + 88 (<compiler-generated>:0)
21  XXXXX                           0x0000000103092310 specialized MutablePersistableRecord.aroundSave(_:save:) + 88 (<compiler-generated>:0)
22  XXXXX                           0x0000000103092310 protocol witness for MutablePersistableRecord.aroundSave(_:save:) in conformance ChatEvent + 88 (<compiler-generated>:0)
23  XXXXX                           0x0000000103092310 specialized PersistableRecord.upsert(_:) + 88 (<compiler-generated>:0)
24  XXXXX                           0x0000000103092310 closure #1 in closure #1 in MainChatService.setUpDatabaseStream() + 128 (ChatService.swift:351)
25  XXXXX                           0x00000001030b89e4 partial apply for closure #1 in closure #1 in MainChatService.setUpDatabaseStream() + 16 (<compiler-generated>:0)
26  XXXXX                           0x000000010355eb1c specialized Database.inTransaction(_:_:) + 100
27  XXXXX                           0x000000010355ea9c specialized Database.inTransaction(_:_:) + 12
28  XXXXX                           0x0000000103570510 specialized Database.inTransaction(_:_:) + 24 (DatabasePool.swift:802)
29  XXXXX                           0x0000000103570510 closure #1 in DatabasePool.writeInTransaction(_:_:) + 32 (<compiler-generated>:0)
30  XXXXX                           0x0000000103570510 partial apply for closure #1 in DatabasePool.writeInTransaction(_:_:) + 56
31  XXXXX                           0x000000010356dbc0 specialized closure #1 in SerializedDatabase.sync<A>(_:) + 44 (SerializedDatabase.swift:127)
32  XXXXX                           0x0000000103570f90 partial apply for specialized closure #1 in SerializedDatabase.sync<A>(_:) + 20 (<compiler-generated>:0)
33  libswiftDispatch.dylib          0x00000001a7deed4c partial apply for thunk for @callee_guaranteed () -> (@out A, @error @owned Error) + 28 (<compiler-generated>:0)
34  libswiftDispatch.dylib          0x00000001a7df6ea0 thunk for @callee_guaranteed () -> (@out A, @error @owned Error)partial apply + 16
35  libswiftDispatch.dylib          0x00000001a7deec80 closure #1 in closure #1 in OS_dispatch_queue._syncHelper<A>(fn:execute:rescue:) + 192 (Queue.swift:399)
36  libswiftDispatch.dylib          0x00000001a7deebb4 partial apply for thunk for @callee_guaranteed () -> () + 28 (<compiler-generated>:0)
37  libswiftDispatch.dylib          0x00000001a7deeb8c thunk for @escaping @callee_guaranteed () -> () + 28 (<compiler-generated>:0)
38  libdispatch.dylib               0x00000001a68fadd4 _dispatch_client_callout + 20 (object.m:576)
39  libdispatch.dylib               0x00000001a690a2c4 _dispatch_lane_barrier_sync_invoke_and_complete + 56 (queue.c:1100)
40  libswiftDispatch.dylib          0x00000001a7def788 implicit closure #2 in implicit closure #1 in OS_dispatch_queue.asyncAndWait<A>(execute:) + 192
41  libswiftDispatch.dylib          0x00000001a7def6c0 partial apply for implicit closure #2 in implicit closure #1 in OS_dispatch_queue.sync<A>(execute:) + 76 (<compiler-generated>:0)
42  libswiftDispatch.dylib          0x00000001a7def540 OS_dispatch_queue._syncHelper<A>(fn:execute:rescue:) + 404 (Queue.swift:396)
43  libswiftDispatch.dylib          0x00000001a7def37c OS_dispatch_queue.asyncAndWait<A>(execute:) + 140
44  libswiftDispatch.dylib          0x00000001a7def2e8 OS_dispatch_queue.sync<A>(execute:) + 64
45  XXXXX                           0x0000000103569048 specialized SerializedDatabase.sync<A>(_:) + 344 (SerializedDatabase.swift:125)
46  XXXXX                           0x000000010356d4e8 DatabasePool.writeInTransaction(_:_:) + 44 (DatabasePool.swift:801)
47  XXXXX                           0x0000000103091c4c closure #1 in XXXXX.setUpDatabaseStream() + 88 (XXXXX.swift:350)
48  XXXXX                           0x00000001030b92c9 partial apply for closure #1 in XXXXX.setUpDatabaseStream() + 1
49  XXXXX                           0x000000010346ba01 closure #1 in closure #1 in Publisher<>.sink(receiveValue:) + 1
50  XXXXX                           0x000000010346b8e5 partial apply for closure #1 in closure #1 in Publisher<>.sink(receiveValue:) + 1 (<compiler-generated>:0)
51  XXXXX                           0x000000010346b991 specialized thunk for @escaping @callee_guaranteed @Sendable @async () -> (@out A) + 1 (<compiler-generated>:0)
52  XXXXX                           0x000000010346b40d partial apply for specialized thunk for @escaping @callee_guaranteed @Sendable @async () -> (@out A) + 1 (<compiler-generated>:0)
53  libswift_Concurrency.dylib      0x00000001a9d92775 completeTaskWithClosure(swift::AsyncContext*, swift::SwiftError*) + 1 (Task.cpp:463)

What did you expect to happen?

The DB is configured with observesSuspensionNotifications, and the app is set up to send a notification in a SwiftUI handler:

 .onChange(of: phase) { _, newPhase in
            switch newPhase {
            case .background:
                NotificationCenter.default.post(name: Database.suspendNotification, object: self)
     }
}

Expected the statement to either abort through sqlite3_interrupt or not get executed (checkForSuspensionViolation).

Environment

GRDB flavor(s): Stock GRDB version: 6.18.0 Installation method: SPM Xcode version: 15.3 Swift version: 5.10 Platform(s) running GRDB: iOS macOS version running Xcode: Sonoma 14.1.1 (23B81)


Upon inspection of the call path, checkForSuspensionViolation is only called through upon the first step into a statement (if sqlite3_stmt_busy(sqliteStatement) == 0), meaning that multi-step statements will be allowed to re-acquire the file lock even after the DB became suspended. While it seems unlikely that upsert is a multi-step statement, we don't use FTS5, so it still looks like the DB is executing a write after a suspense.

The DB is shared between the main app and a notification service extension, which too sends Database.suspendNotification as appropriate.

Could you please have a look and see if e.g. checkForSuspensionViolation shouldn't be called more often?

groue commented 5 months ago

Hello @nkbelov,

meaning that multi-step statements will be allowed to re-acquire the file lock even after the DB became suspended.

Yes indeed. I thought it was enough to make the check in the first step. If it would be called on each step, that would probably have an impact on performance (I did not perform any benchmark, though).

If the app is executing a multi-steps statement, it should have the time to finish them before the app transitions from the background to the suspended state. If it does not, I'm not sure GRDB should be held responsible. Maybe such apps should perform iterations in batches, for example.

[...] so it still looks like the DB is executing a write after a suspense.

The write lock was still held when the app turned suspended, yet. This might mean that a write is happening. But maybe a write transaction is still open. Maybe a statement was not properly released.

I'm sorry but I do not live in your app, so I can't tell what's happening and I won't make uneducated guesses. Sharing SQLite databases on iOS is hard. I recommend that you audit each and everyone of the writes performed by the app and make a clear understanding of how it could relate with suspension.

Also maybe check that the suspension notification is sent when the app is entering the background - this is a UIApplication concept, pre-SwiftUI. Suspension and resuming is some code I would expect to see in an UIApplicationDelegate. You can put it where you want, but make sure you aren't playing games with the OS — it bites.

groue commented 5 months ago

@nkbelov, some ideas came to me reading your stack trace:

I recommend that you audit each and everyone of the writes performed by the app

Well, the guilty write is just in front of our eyes 😅

At first I assumed that the database was indeed suspended. I was surprised that DatabasePool.writeInTransaction did not throw an error.

Then I realized it won't do so by default. I thus recommend that you add this line in your application, before you open the DatabasePool:

+configuration.defaultTransactionKind = .immediate
 let dbPool = try DatabasePool(path: path, configuration: configuration)

It was recently added to the recommendations for shared databases, because it has other benefits (see #1485). The next GRDB7 version will perform this configuration automatically.

Later on I saw that the INSERT statement performed by upsert was not caught by checkForSuspensionViolation(from:). I'm thus pretty sure that the database was not suspended when the crash occurred.

You should be able to test my advice and assumptions: suspend, and check what happens:

func test_database_suspension_prevents_some_operation() {
    let xxxxx = XXXXX()
    NotificationCenter.default.post(name: Database.suspendNotification, object: self)
    XCTAssertThrows(try xxxxx.setUpDatabaseStream())
}

Note to myself: posting notifications make it impossible to run tests in parallel. GRDB should expose publicly the suspend() and resume() methods, so that one can write tests of database suspension without notifications.

nkbelov commented 5 months ago

Thanks @groue. By all means, the error is very likely in our user code and I've been suspecting that we're incorrectly sending the suspend notifications; I've just noticed this lack of checkForSuspensionViolation for multistep operations while looking through the GRDB source to see what guarantees this flag exactly gives, esp. w.r.t. already enqueued statements.

I'm not intricately familiar with the inner machinery of SQLite, but it might actually turn out that this sole check is enough, since if a multistep statement is being tracked as per here, then sqlite3_interrupt will interrupt it as well, and all statements that were issued before DB suspended will correctly cancel.

If not, however, I wonder if this is an opportunity to enhance this a bit (regardless of the particularities of my specific case), since this situation looks to me akin a Task never checking for cancellation (except for the one time at the beginning), and it seems that the majority of 0xdead10cc triggers arise when code fails to shut itself down on time and continues to (re-)claim resources. Perhaps using an atomic boolean for the isSuspended flag could address the potential performance impact, too.


I will be trying out the recommendations, thanks a lot. The code in question was indeed written before the note on .immediate was published.

groue commented 5 months ago

I've just noticed this lack of checkForSuspensionViolation for multistep operations while looking through the GRDB source to see what guarantees this flag exactly gives, [...]

It is indeed called once before the first step (screenshot).

Capture d’écran 2024-04-29 à 16 13 29

We can say that the 0xdead10cc prevention mechanism of GRBD makes optimistic assumptions today.

Last time I checked, the system was giving a lot of time between the notification of the expiration of the background state and the actual suspension. I wanted to use this delay to let the app successfully commit a short write. That's why we wait for the app to start a "forbidden" statement before throwing an error. Currently those are statements that acquire the write lock, not read-only statements or statements that release the lock.

The comparison with a Task that does not perform all possible checks for cancellation is probably accurate. For example, here is a bad scenario that is not detected:

  1. App starts an immediate transaction.
  2. Database gets suspended, and there's no running statement at this point. There's nothing to interrupt, so app keeps on running.
  3. App performs a very long fetch. We're talking several seconds here. A SELECT is not interrupted by checkForSuspensionViolation, so app keeps on running.
  4. App enters the suspended state: we have a 0xdead10cc crash.

If db would have been suspended during the long iteration, the fetch would have failed thanks to sqlite3_interrupt.

So probably checkForSuspensionViolation could be improved so that the first step of a read-only statement is interrupted when we know for sure that the write lock has been acquired. There is no C function to query this state AFAIK, but the first successful step of a statement that returns false from sqlite3_stmt_readonly should be enough to detect the transition 🤞.

groue commented 5 months ago

If you can think of other nasty scenarios, I'm curious.

groue commented 5 months ago

I'm closing this issue, because I'm pretty sure the app did not suspend the database, leading to the 0xdead10cc exception. Database suspension remains the task of the app, because despite all my attempts at doing it automatically, we eventually face the difficulty of determining when to resume (suspending is easy, it's resuming that's hard).

I'll remember to revisit the topic, though, as discussed above.