launchbadge / sqlx

🧰 The Rust SQL Toolkit. An async, pure Rust SQL crate featuring compile-time checked queries without a DSL. Supports PostgreSQL, MySQL, and SQLite.
Apache License 2.0
13.27k stars 1.26k forks source link

`Pool::close` does not always wait for all connections to close #3217

Open madadam opened 5 months ago

madadam commented 5 months ago

Bug Description

Despite what the documentation says, calling Pool::close does not actually always wait for all the connections to close before returning. This is because of a bug / race condition in PoolInner::close.

This issue is hard to detect because it has almost no observable effects. Probably the only db engine where this is detectable is sqlite. We use sqlite in WAL mode which means there are two additional files next to the main database file - one whose name ends in -wal and the other in -shm. According to the sqlite documentation, those should be deleted when the last connection closes (unless the last connection is read-only). We use a setup where we have a pool with read-only connection and a separate, single read-write connection. We first close the read-only pool using Pool::close and only when it returns we close the read-write connection. The expectation is that the read-write connection should be the last connection at that point and so it should perform the WAL checkpoint and delete the two auxiliary files. We noticed this wasn't always the case. Eventually we traced this issue down to a bug (actually, two bugs) in PoolInner::close which in some cases didn't wait for all the connections to close and so there were still some read-only connections open by the time we closed the read-write one. This prevented it from running the WAL checkpoint and deleting the auxiliary files.

The first bug is a race condition when idle_conns to not always empty by the time the loop finishes. This means any connections still in idle_conns are not closed until the pool itself has been dropped.

Second bug is that when idle_conns is drained, the idle connections are temporarily treated as live which means a semaphore permit is created for them out of thin air. Then when this permit is released it causes the semaphore to get more available permits than it had initially which on the last iteration of the loop, causes this acquire to complete prematurely.

Info

madadam commented 5 months ago

This is my attempt at a fix which seems to work for us:

    pub(super) fn close<'a>(self: &'a Arc<Self>) -> impl Future<Output = ()> + 'a {
        self.mark_closed();

        async move {
            let _permits = self.semaphore.acquire(self.options.max_connections).await;

            while let Some(idle) = self.idle_conns.pop() {
                let _ = idle.live.raw.close().await;
            }

            self.num_idle.store(0, Ordering::Release);
            self.size.store(0, Ordering::Release);
        }
    }