Open madadam opened 5 months ago
This is my attempt at a fix which seems to work for us:
pub(super) fn close<'a>(self: &'a Arc<Self>) -> impl Future<Output = ()> + 'a {
self.mark_closed();
async move {
let _permits = self.semaphore.acquire(self.options.max_connections).await;
while let Some(idle) = self.idle_conns.pop() {
let _ = idle.live.raw.close().await;
}
self.num_idle.store(0, Ordering::Release);
self.size.store(0, Ordering::Release);
}
}
Bug Description
Despite what the documentation says, calling
Pool::close
does not actually always wait for all the connections to close before returning. This is because of a bug / race condition inPoolInner::close
.This issue is hard to detect because it has almost no observable effects. Probably the only db engine where this is detectable is sqlite. We use sqlite in WAL mode which means there are two additional files next to the main database file - one whose name ends in
-wal
and the other in-shm
. According to the sqlite documentation, those should be deleted when the last connection closes (unless the last connection is read-only). We use a setup where we have a pool with read-only connection and a separate, single read-write connection. We first close the read-only pool usingPool::close
and only when it returns we close the read-write connection. The expectation is that the read-write connection should be the last connection at that point and so it should perform the WAL checkpoint and delete the two auxiliary files. We noticed this wasn't always the case. Eventually we traced this issue down to a bug (actually, two bugs) inPoolInner::close
which in some cases didn't wait for all the connections to close and so there were still some read-only connections open by the time we closed the read-write one. This prevented it from running the WAL checkpoint and deleting the auxiliary files.The first bug is a race condition when
idle_conns
to not always empty by the time the loop finishes. This means any connections still inidle_conns
are not closed until the pool itself has been dropped.Second bug is that when
idle_conns
is drained, the idle connections are temporarily treated aslive
which means a semaphore permit is created for them out of thin air. Then when this permit is released it causes the semaphore to get more available permits than it had initially which on the last iteration of the loop, causes this acquire to complete prematurely.Info
rustc --version
: rustc 1.77.1 (7cf61ebde 2024-03-27)