Connections not being returned to pool on transaction failure

AlexLaviolette commented 6 years ago

Currently on play 2.6 with play-slick 3.0.1, I’ve been running into an issue during load testing where the app is consuming tons of db connections from the pool and not returning them. I have narrowed it down to the use of transactionally, but have still not been able to pin point the exact cause.

As proof of this, when I increase the db connection pool to 340 (our max_connections is 350), I can hit up to 28k rpm on one server. But with a pool of only 100, it almost instantly crashes under the same load with the connections timing out waiting for a connection from the pool. I can confirm that the database transactions are happening rapidly. When I enable hikari connection leak detection, I can see that some connections are taking > 60s to be returned, yet the database is showing nothing over 5 seconds. From everything I’ve read, we should be able to configure this to run with a much smaller pool. When I remove the use of transactionally, the behaviour of the pool returns to normal.

Here are some logs: with pool of 100 and transactionally enabled

com.zaxxer.hikari.pool.HikariPool - db - Timeout failure stats (total=100, active=100, idle=0, waiting=33)

with pool of 340 and transactionally enabled

com.zaxxer.hikari.pool.HikariPool - db - After cleanup stats (total=340, active=244, idle=96, waiting=0)

Some errors with pool of 100 and transactionally enabled:

java.sql.SQLTransientConnectionException: db - Connection is not available, request timed out after 10000ms. at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:548) at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:186) at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:145) at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:83) at slick.jdbc.hikaricp.HikariCPJdbcDataSource.createConnection(HikariCPJdbcDataSource.scala:14) at slick.jdbc.JdbcBackend$BaseSession.(JdbcBackend.scala:438) at slick.jdbc.JdbcBackend$DatabaseDef.createSession(JdbcBackend.scala:46) at slick.jdbc.JdbcBackend$DatabaseDef.createSession(JdbcBackend.scala:37) at slick.basic.BasicBackend$DatabaseDef.acquireSession(BasicBackend.scala:216) at slick.basic.BasicBackend$DatabaseDef.acquireSession$(BasicBackend.scala:215)

NewRelic is also showing these requests spending the majority of their time waiting on com.zaxxer.hikari.HikariDataSource.getConnection()

Here is an example of the code that is causing the issue:

def _insertIfNotExists(partner: Partner, accountId: String): DBIO[Account] = {
    accounts
      .filter(_.account_id === accountId)
      .filter(_.partner === partner)
      .result.headOption.flatMap {
      case Some(account) => DBIO.successful(account) // no-op
      case _ => {
        accounts.returning(accounts.map(_.id)).into((account, id) => account.copy(id = Some(id))) += Account(None, partner, accountId)
      }
    }
  }

  def createWithDeviceIfNotExists(partner: Partner, accountId: String, deviceId: String): Future[Device] = {
    db.run(
      _insertIfNotExists(partner, accountId).flatMap { account =>
        devices._insertIfNotExists(deviceId, account.id.get) // this looks exactly like the _insertIfNotExists above
      }.transactionally
    )
  }

I am unable to reproduce this at low load or locally by manually causing the queries to fail. But something about using transactionally here is causing connections to not be returned to the pool for a very long time. There are reports that Previously reported leaked connection com.mysql.jdbc.JDBC4Connection@795acb1e was returned to the pool (unleaked) but often after several minutes.

octonato commented 6 years ago

@AlexLaviolette can you share your slick configuration?

The number of threads used by Slick must be equal to the number of connections in the pool.

AlexLaviolette commented 6 years ago

I was just using the default configuration for slick. But after reading through https://github.com/slick/slick/issues/1614 and some other discussions and trying several of the suggestions in there (including max_conns = num_threads) I am still running into this issue.

My configuration before was:

slick.dbs {
  default {
    profile="slick.jdbc.MySQLProfile$"
    db {
      driver="com.mysql.jdbc.Driver"
      url=${PLAY_DB_URL}
      user=${PLAY_DB_USER}
      password=${PLAY_DB_PASSWORD}
      leakDetectionThreshold=60000
    }
  }
}

I have since upgraded play-slick from 3.0.1 -> 3.0.3 which is the latest version that I believe supports play 2.6.x (I am using 2.6.12)

I also upgraded slick from the version bundled in play-slick from 3.2.1 -> 3.2.3. Unfortunately, it seems like the relevant bug fixes were present in 3.2.1 already, but I gave it a shot regardless. This did provide a nice warning message which basically echoes what is said in the discussion:

[warn] slick.util.AsyncExecutor - Having maxConnection > maxThreads can result in deadlocks if transactions or database locks are used.

I configured slick as follows which stifled the warning message

slick.dbs {
  default {
    profile="slick.jdbc.MySQLProfile$"
    db {
      driver="com.mysql.jdbc.Driver"
      url=${PLAY_DB_URL}
      user=${PLAY_DB_USER}
      password=${PLAY_DB_PASSWORD}
      leakDetectionThreshold=60000
      numThreads=20
      maxConnections=20
    }
  }
}

However I am still encountering the error. In the discussion it seemed that if this fix only applied to Sequence and AndThen while I was using flatMap. I am unsure if the fix didn’t apply because it didn’t need fixing or it wasn’t fixed.

So as a test, I switched to using an AndThen query - I still encountered the issue unfortunately. However, this did eventually show a new exception I had not seen yet, although I’m not sure how relevant it is:

Exception in thread “db-10” java.lang.IllegalArgumentException: requirement failed: count cannot be decreased
at scala.Predef$.require(Predef.scala:277)
at slick.util.ManagedArrayBlockingQueue.$anonfun$decreaseInUseCount$1(ManagedArrayBlockingQueue.scala:54)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at slick.util.ManagedArrayBlockingQueue.locked(ManagedArrayBlockingQueue.scala:201)
at slick.util.ManagedArrayBlockingQueue.decreaseInUseCount(ManagedArrayBlockingQueue.scala:53)
at slick.util.AsyncExecutor$$anon$2$$anon$1.afterExecute(AsyncExecutor.scala:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Please let me know if anything sticks out to you, and thanks for responding!

ignasi35 commented 6 years ago

@AlexLaviolette thanks for the super-thorough explanations and tests. If you could provide a reproducer it'd be a great help to narrow down the issue.

ignasi35 commented 6 years ago

@AlexLaviolette thanks for the super-thorough explanations and tests. If you could provide a reproducer it'd be a great help to narrow down the issue.

AlexLaviolette commented 5 years ago

I've put together a demo application here that replicates the issue https://github.com/AlexLaviolette/slick-demo. It's basically a stripped down version of what we are actually running that only includes an example of the problematic code.

I have that deployed to a single machine (2GB RAM VM), connected to a single mysql db (1GB RAM VM).

The script I run to reproduce it is in the repo under /test/api_load_test.py. I am able to reproduce it running that from a single VPS (8GB RAM, 4 CPUs)

Hopefully this helps, let me know if you need anything else.

AlexLaviolette commented 5 years ago

I forked this and added the change referenced here to version 3.0.3. Unfortunately, this did not seem to fix my issue.

dspasojevic commented 4 years ago

We are experiencing this same issue. Dropping transactionally resolves the issue for us as well.

@AlexLaviolette did you find a resolution for this in the end?

dspasojevic commented 4 years ago

For future searchers, we got to the bottom of this issue. Ours was actually caused by an error on our side.

We'd flatmapped DBIOs and Futures using DBIO.from. The implementation of the method that was returning the Future was changed to use the database, meaning that it had its own transaction.

So it looked something like:

val dbio = for {
   a <- getAFromDatabase // a DBIO
   b <- getBFromDatabase // a DBIO
   c <- DBIO.from(calculateC) // A Future, but needing to run a DBIO to perform the calculation
   d <- doIt(a, b, c) // a DBIO
} yield d

x.run(dbio.transactionally)

Because the outer transaction (where the DBIO and Future were combined) was still active while the Future was running, in some cases the inner transaction run to produce the Future result was not able to acquire a connection for that transaction.

Under load, this quickly lead to a deadlock - all connections were consumed by the outer transactions and none of the inner transactions (used by calculateC) were able to acquire a connection from the pool.

This is unlikely to be related to the problem faced by @AlexLaviolette but did have the same symptoms.

petercoppensdatylon commented 3 years ago

Because the outer transaction (where the DBIO and Future were combined) was still active while the Future was running, in some cases the inner transaction run to produce the Future result was not able to acquire a connection for that transaction.

Under load, this quickly lead to a deadlock - all connections were consumed by the outer transactions and none of the inner transactions (used by calculateC) were able to acquire a connection from the pool.

Thanks a lot! Was looking for a hangup for 8 hours or so and slowly started to understand that it's related to nested transactions. What you describe is exactly what I have and after your description, working around it was trivial. Thanks mate! Peter.

OlegYch commented 3 years ago

In my case this was caused by NewRelic javaagent (using play 2.8.7 slick 3.3.3 (numThreads = 5) hikari 4.0.3 mysql connector 8.0.22) with newrelic: without newrelic:

playframework / play-slick

Connections not being returned to pool on transaction failure #445