mysql-net / MySqlConnector

MySQL Connector for .NET
https://mysqlconnector.net
MIT License
1.39k stars 337 forks source link

Connection pool discrepancies #979

Open huraken opened 3 years ago

huraken commented 3 years ago

Hello,

We discovered last few weeks some connection pool issues in our production environment and so we decide to dig a bit more with some load tests. For that we compare MySqlConnector and MySql.Data libraries

Giving some context about the load tests we ran :

Libraries used MySqlConnector version : 1.3.2 MySql.Data (Oracle) : 0.8.24

Infrastructure : Applications and Databases are hosted in AWS cloud For databases, we’re using AWS RDS with Aurora MySQL engine 5.6.10, we have a DB cluster that contains 5 Db instances (1 writer and 4 readers) Applications are containerized with docker (ECS) and we have 6 connections strings defined (3 of them point to the AWS DB cluster reader endpoint and 3 of them point to the AWS DB cluster writer endpoint) Load test I ran a load test that simulate 1K http requests / second, each request is calling a simple api that is making 5 DB calls to the database. The test will ran during 20 minutes : it will take 10 minutes to reach 1K calls and it will continue with this rate 10 more minutes. Those 5 DB calls are using only one connection string to get the relevant data. To simplify this test, I only had one instance of the application (so 1 container)

Connection string Server=***.us-east-1.rds.amazonaws.com;Port=3306;Database=***;Uid=***;Pwd=***;AllowUserVariables=True;ConnectionLifeTime=60;ConnectionTimeout=5;SslMode=None;UseAffectedRows=false;ConnectionIdleTimeout=60;MaximumPoolsize=200;

Example of code used for a call to database

public async Task<TestData> GetDataAsync(Guid field1)
        {
            TestData testData = null;

            var query = @"SELECT `field1`, `field2`
                            FROM `table1` 
                            WHERE `field1` = @field1";

            await using (var sqlConnection = new MySqlConnection(connectionString))
            await using (var sqlCommand = new MySqlCommand(query, sqlConnection))
            {

                sqlCommand.Parameters.Add(new MySqlParameter
                {
                    ParameterName = "@field1",
                    MySqlDbType = MySqlDbType.Binary,
                    Value = field1.ToByteArray()
                });

                await sqlConnection.OpenAsync();

                await using (var sqlDataReader = await sqlCommand.ExecuteReaderAsync())
                {
                    if (await sqlDataReader.ReadAsync())
                    {
                        testData = new TestData();

                        if (sqlDataReader.IsDBNull(0))
                        {
                            return null;
                        }

                        testData.Field1 = sqlDataReader.GetGuid(0);
                        testData.Field2 = !sqlDataReader.IsDBNull(1) ? sqlDataReader.GetString(1) : null;
                    }

                }
            }

            return testData;
        }

Load test results using MySqlConnector library

Number of TCP connections before load test image

Number of TCP connections during load test image

Number of TCP connections after load test image

AWS Database connections image

Load test results using MySql.Data library

Number of TCP connections before load test image

Number of TCP connections during load test image

Number of TCP connections after load test image

AWS Database connections image

Comparing the tests, we can see MySqlConnector reached the maximum pool size (in TCP connections, we have 200 tcp connections for the same connection string + 1 for another connection string) but for MySql.Data library, we only have 48 connections. I’m not sure to understand why there is a big discrepancy of connections between the 2 libraries

PS : I can provide MySqlConnector logs if needed

bgrainger commented 3 years ago

Thanks for the detailed writeup.

My first thought is that using async I/O in MySqlConnector could be contributing to the discrepancy (by allowing greater parallelism and throughput). (MySql.Data doesn't support async at all.)

You could test this by adding ;ForceSynchronous=True to your MySqlConnector connection string; this will make all XyzAsync method calls actually execute synchronously, just like MySql.Data. Does that change the results you see?

bgrainger commented 3 years ago

You could also add ;DeferConnectionReset=true to MySqlConnector, as that will eliminate background connection resetting work that reduces latency, but can result in higher concurrent database session usage.

(Both these suggestions will likely reduce MySqlConnector performance, but may have the side effect of reducing the total number of concurrent active connections. IMO, MySqlConnector is "working as designed" here: it's peaking up to the maximum configured limit in order to serve requests.)

bgrainger commented 3 years ago

Also note that MySql.Data doesn't reset the connection state when MySqlConnection.Open(Async) is called, but MySqlConnector does.

For a fair comparison, add ;ConnectionReset=true (recommended) or ;ConnectionReset=false to both connection strings.

huraken commented 3 years ago

Hello, Thanks for your help @bgrainger I just made a load test using the option ForceSynchronous=True with MySqlConnector, but the results were worst : still have 200 connections opened and also this time I got a bunch of errors related to connection timeout or all pooled connections are in use.

image

Connect Timeout expired. All pooled connections are in use.
 ---> MySqlConnector.MySqlException (0x80004005): Connect Timeout expired. All pooled connections are in use.
 ---> System.OperationCanceledException: The operation was canceled.
   at System.Threading.SemaphoreSlim.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at MySqlConnector.Core.ConnectionPool.GetSessionAsync(MySqlConnection connection, Int32 startTickCount, IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/Core/ConnectionPool.cs:line 139
   at MySqlConnector.MySqlConnection.CreateSessionAsync(ConnectionPool pool, Int32 startTickCount, Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 892
   at MySqlConnector.MySqlConnection.CreateSessionAsync(ConnectionPool pool, Int32 startTickCount, Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 892
   at MySqlConnector.MySqlConnection.OpenAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 435
Connect Timeout expired.
 ---> MySqlConnector.MySqlException (0x80004005): Connect Timeout expired.
 ---> MySqlConnector.MySqlException (0x80004005): Connect Timeout expired.
 ---> System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.Socket'.
   at System.Net.Sockets.Socket.GetSocketOption(SocketOptionLevel optionLevel, SocketOptionName optionName)
   at MySqlConnector.Core.ServerSession.OpenTcpSocketAsync(ConnectionSettings cs, ILoadBalancer loadBalancer, IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/Core/ServerSession.cs:line 979
   at MySqlConnector.Core.ServerSession.OpenTcpSocketAsync(ConnectionSettings cs, ILoadBalancer loadBalancer, IOBehavior ioBehavior, CancellationToken cancellationToken)
   at MySqlConnector.Core.ServerSession.ConnectAsync(ConnectionSettings cs, Int32 startTickCount, ILoadBalancer loadBalancer, IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/Core/ServerSession.cs:line 403
   at MySqlConnector.Core.ConnectionPool.ConnectSessionAsync(String logMessage, Int32 startTickCount, IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/Core/ConnectionPool.cs:line 427
   at MySqlConnector.Core.ConnectionPool.GetSessionAsync(MySqlConnection connection, Int32 startTickCount, IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/Core/ConnectionPool.cs:line 139
   at MySqlConnector.Core.ConnectionPool.GetSessionAsync(MySqlConnection connection, Int32 startTickCount, IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/Core/ConnectionPool.cs:line 139
   at MySqlConnector.MySqlConnection.CreateSessionAsync(ConnectionPool pool, Int32 startTickCount, Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 892
   at MySqlConnector.MySqlConnection.CreateSessionAsync(ConnectionPool pool, Int32 startTickCount, Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 892
   at MySqlConnector.MySqlConnection.OpenAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 435

I will try you other suggestions today The ideal world would to keep it fully async and avoid reaching the connection pool limit, for me, a max pool size of 200 connections sounds a lot but maybe not

huraken commented 3 years ago

I made another load test using this connection string

Server=***.us-east-1.rds.amazonaws.com;Port=3306;Database=***;Uid=***;Pwd=***;AllowUserVariables=True;ConnectionLifeTime=60;ConnectionTimeout=5;SslMode=None;UseAffectedRows=false;MaximumPoolsize=200;ForceSynchronous=True;DeferConnectionReset=true;ConnectionReset=true;

The pool didn't reach the connection limit this time and I didn't see errors image

However like you mentioned, it will reduce performance and will not act in an async way I'm a bit confused about what should be the best solution : for example, if we increase the MaxPoolSize to 400 or more, is it still possible to get Connect Timeout expired. All pooled connections are in use. or Connect Timeout expired. depending of the current traffic ?

bgrainger commented 3 years ago

I ran a load test that simulate 1K http requests / second

What is the maximum number of simultaneous HTTP connections your load test reaches? If it's 1000 at the beginning of each second, that's going to be quite different behaviour from a consistent rate of 100 simultaneous connections (that take 10ms to execute each).

The ideal world would to keep it fully async and avoid reaching the connection pool limit

If there are 200 simultaneous HTTP connections, you will get 200 simultaneous MySQL connections, by design. This reduces overall latency of your API. So the number of concurrent HTTP connections is critical to understanding the results.

bgrainger commented 3 years ago

The pool didn't reach the connection limit this time and I didn't see errors

I've found an issue with DeferConnectionReset=false (the default) that can increase the total number of MySQL connections to the server, and also increase P99 latency. Just changing that one connection string setting to DeferConnectionReset=true may be enough to eliminate the discrepancy between MySqlConnector and MySql.Data; can you test with just that?

bgrainger commented 3 years ago

@huraken Can you try your original scenario (with the original connection string) again with MySqlConnector 1.3.8? I'm hoping to see some significant improvements.

huraken commented 3 years ago

Indeed it's not 1000 simultaneous http connections, but an average of 1000 http connections sent over one second. During each http request, it will execute sequentially 5 calls to the database (each of them are similar to the code example provided). Looking at some metrics, it took around 30-40ms to process each http request. I'm not sure if there an easy way to get the number of simultaneous connections for a specific moment

Thanks for this fix, it sounds promising, I will run some load tests next monday with the version 1.3.8 and original connection string

huraken commented 3 years ago

@bgrainger I tried to run a new load test with the version 1.3.8 and it still reach the maximum pool size (200) I don't have errors in application logs related to the connection pool In the screenshot, I recorded the number TCP connections to the database (port 3306) and number of TCP connections from the .net service (port 5000). During the first 10 minutes (ramp up phase), the number of connections increase steadily on both sides, but when it approaches 1K connections per second, I noticed some connection spikes with HTTP concurrent connections that could exceed 200

image

bgrainger commented 3 years ago

I tried to run a new load test with the version 1.3.8 and it still reach the maximum pool size (200)

Based on your connection counts, I think this is actually good and desirable; am I wrong?

root@05bef88afad9:/# ss -t | grep -w "5000" | wc -l
502
root@05bef88afad9:/# ss -t | grep -w "3306" | wc -l
201

There are currently 500 HTTP clients connected to your service. The only (?) thing they're doing is executing MySQL operations, plus some ASP.NET overhead (correct?). For maximum throughput, we'd want as many HTTP handlers as possible to be communicating with the database simultaneously. Thus, reaching MaxPoolSize=200 open connections seems like the desirable state.

To me, the question is why does MySql.Data only open 48 connections to serve 500 concurrent HTTP requests? Is it serving them with the same latency as MySqlConnector?

FWIW, I was not able to create a benchmark that ran successfully with MySql.Data; it started logging dozens of HTTP 5xx responses as soon as I got over ~50 concurrent requests.

huraken commented 3 years ago

With an average of 1000 requests per second, and an average latency of 30ms, I don't really understand why we see 200 + concurrent connections. For example, if we have ~30 concurrent connections that took 30 ms to process each http request, it should be enough to handle 1000 requests per second, maybe I'm missing something

image image

Yes this api is only executing MySQL operations + ASP.NET setup I will try another test with MySql.Data and share the latency results

huraken commented 3 years ago

FYI @bgrainger , I ran another load test with MySql.Data library, and we can see we have more than 200 concurrent HTTP requests but the number of DB connections stays under 50

image

About the latency, it seems a bit higher than MySqlConnector, but not sure if it can explains the difference (we can see a spike of 63ms for the average, but otherwise it stays around 30-40 ms)

image

peppy commented 3 years ago

I'd like to mention that I also had some serious issues with connection pooling (game server with 30-50k concurrent users, ~2kqps) and resorted to manually pooling at our end via a simple ConcurrentQueue and wrapper class around MySqlConnection, disabling the MySqlConnector pooling mechanism completely.

IMG_0267

Since doing so we've not run into any issues (we used to on a regular basis, weird thread pool / memory related deaths caused by database operations stalling). I will try and provide some more details on our specific case since I think it may be quite similar to the one being reported here.

peppy commented 3 years ago

I'm going to provide as much context as I can find from my messages with team members over the issue. This is coming from 8 months ago, and at the time I didn't have the bandwidth to report it to the project (apologies for that - we were trying to fix a live production deploy and at the time grasping at straws).

Our project uses kestrel to serve ~10,000 http requests per second on a single instance. Database requests were pooled, with any database operations that can be run async running in a separate low priority queue.

With increasing user load we noticed failures begin to show, which surfaced as kestrel not being able to server requests. In further investigation this was caused by too many kestrel contexts being active simultaneously (which in itself shouldn't be a huge deal, but led to a weird thread pool starvation issue).

In the process of trying everything to figure out what was going wrong, I tried turning off connection pooling in MySqlConnector. Apart from the extra tcp churn and cpu overhead this caused, it did look to solve the larger issue at hand.

When the issue occurred, database operations would stall on retrieving connections from the MySqlConnector pool:

Hypothesis is that there is some weird async logic in the pool retrieval logic which is causing thread pool starvation.

We tried switching to MySql.Data but it had its own issues so we did not follow that path any further.

Our current solution is to disable pooling in MySqlConnector and use a local pool. @bgrainger asked for the code for this so I'll include it below (should be self-explanatority; i've included all our helper functions but the important part is the PooledConnection class and GetConnection method). Using inside of using(var conn = Database.GetConnection()) is intended:

using System;
using System.Data;
using MySqlConnector;

namespace osu_Bancho.Helpers
{
    internal class PooledConnection : IDisposable
    {
        private readonly MySqlConnection conn;

        public MySqlCommand LastCommand;

        public PooledConnection(MySqlConnection conn)
        {
            this.conn = conn;
            conn.Open();
        }

        public MySqlConnection Connection
        {
            get
            {
                LastCommand?.Dispose();
                LastCommand = null;

                return conn;
            }
        }

        public void Dispose()
        {
            if (Bancho.UseLocalConnectionPool && Database.Connections.Count < Bancho.LocalConnectionPoolMaxConnections)
            {
                if (Connection.State == ConnectionState.Open)
                    Database.Connections.Enqueue(this);
            }
            else
            {
                Connection.Dispose();
            }
        }
    }

    internal static class Database
    {
        private static string connectionString;

        public static string ConnectionString
        {
            get => connectionString;
            set
            {
                if (connectionString == value) return;

                connectionString = value;

                MySqlConnection.ClearAllPools();
                Connections.Clear();
            }
        }

        public static ConcurrentQueue<PooledConnection> Connections = new ConcurrentQueue<PooledConnection>();

        internal static PooledConnection GetConnection()
        {
            if (Bancho.UseLocalConnectionPool && Connections.TryDequeue(out var local) && local.Connection.State == ConnectionState.Open)
            {
                DogStatsd.Increment("database_queries", sampleRate: 0.1, tags: new[] { "retrieval:localpool" });
                return local;
            }

            DogStatsd.Increment("database_queries", sampleRate: 0.1, tags: new[] { "retrieval:connectorpool" });
            return new PooledConnection(new MySqlConnection(ConnectionString));
        }

        internal static MySqlDataReader RunQuery(PooledConnection conn, string sqlString, params MySqlParameter[] parameters)
        {
            MySqlCommand c = conn.Connection.CreateCommand();

            // store command to connection for clean-up before re-use.
            conn.LastCommand = c;

            if (parameters != null)
                c.Parameters.AddRange(parameters);
            c.CommandText = sqlString;
            c.CommandTimeout = 5;
            return c.ExecuteReader();
        }

        internal static object RunQueryOne(PooledConnection conn, string sqlString, params MySqlParameter[] parameters)
        {
            using (MySqlCommand c = conn.Connection.CreateCommand())
            {
                c.Parameters.AddRange(parameters);
                c.CommandText = sqlString;
                c.CommandTimeout = 5;

                using (var reader = c.ExecuteReader())
                {
                    if (!reader.Read())
                        return null;

                    return reader.GetValue(0);
                }
            }
        }

        internal static void RunNonQuery(PooledConnection conn, string sqlString, params MySqlParameter[] parameters)
        {
            using (MySqlCommand c = conn.Connection.CreateCommand())
            {
                c.Parameters.AddRange(parameters);
                c.CommandText = sqlString;
                c.CommandTimeout = 5;
                c.ExecuteNonQuery();
            }
        }

        internal static DataSet RunDataset(PooledConnection conn, string sqlString, params MySqlParameter[] parameters)
        {
            using (var c = conn.Connection.CreateCommand())
            using (var da = new MySqlDataAdapter())
            using (var ds = new DataSet())
            {
                c.CommandText = sqlString;
                c.Parameters.AddRange(parameters);

                da.SelectCommand = c;
                da.Fill(ds);
                return ds;
            }
        }
    }
}

This has scaled to over 200% of the previous capacity where we were running into the aforementioned issues, so we haven't touched it since. Here's a view of "slow" (>50ms to process a request) kestrel contexts before and after this single change:

I know I also did several dotnet trace captures on production and analysed further, which will likely give more insight into this issue, but I can't immediately find details on this. I will post again if I manage to dig these up.

Sorry is this isn't too useful, but if looking into this further I would definitely start by checking thread pool usage during the reported connection spikes and see if there is any correlation. If so, it may be an async/wait root cause. It may also be more likely to show itself if multiple components are using the TPL thread pool.

Also note that I'm not 100% sure this correlates to the issue reported in this thread, but hopefully may serve as a path for further investigation. @huraken I'd propose you give my workaround above a test, but also monitor TPL usage at your end to see if there is potential contention there.

huraken commented 3 years ago

Hello @peppy ,

Thanks a lot for your detailed reply, I'm sure it will be useful

I will try your custom solution to manage DB connections pooling.

But before that I would need to have more visibility using metrics, and I want to understand a few things :

Could you explain a bit more what you mean by "Kestrel contexts" ? How did you get this information ?

About the thread pool metrics, I guess you use the method ThreadPool.GetAvailableThreads and push it regularly in your metrics system ?

In your code example, you're not using any Async methods available in MySqlConnector library, is it expected ?

PS : do you remember what MySqlConnector version did you use at the time ?

peppy commented 3 years ago

Kestrel contexts is specific to my use case. The rest of what you say is correct. We track the latest version of MySqlConnector. You can use async versions of functions (just run directly on the connection; ignore the helper functions I included).

huraken commented 3 years ago

@bgrainger I checked our application logs for one container and I noticed a lot of messages that seems related to creation/close of DB connections Currently we have the setting "ConnectionLifeTime=60" to be able to load balance between existing DB instances, and also if we have a new DB instance that spawns, application will be able to send traffic to this new instance.

I suspect this setting is causing some spike of connections needed in the pool at the same time. Do you know if we can distribute more smoothly the closure of connections: for example, instead of a fixed number like 60 seconds, it will close connections between 60 and 70 seconds when they return to the pool ?

During load testing image

One minute later image

Another question, if we don't use this form to create a MySqlConnection

await using (var sqlConnection = new MySqlConnection(connectionString))
await using (var sqlCommand = new MySqlCommand(query, sqlConnection))
{
        await sqlConnection.OpenAsync();
}

but instead

using (var sqlConnection = new MySqlConnection(connectionString))
using (var sqlCommand = new MySqlCommand(query, sqlConnection))
{
        await sqlConnection.OpenAsync();
}

Could it cause some issue when disposing MySqlConnection instance ?

Thank you

bgrainger commented 3 years ago

Currently we have the setting "ConnectionLifeTime=60" ... I suspect this setting is causing some spike of connections needed in the pool at the same time.

Yes, I suspect that's true. A bunch of connections will be created when the app starts up, be used for a minute, then all die, and another spike of connections will be created.

It would be better to smooth them out, but I'm not sure how to accomplish that with the current options. Ideally, perhaps some jitter strategy would be desired. Perhaps it would be configurable, or perhaps we hardcode it to be ConnectionLifeTime ± Min(60s, Max(10s, ConnectionLifeTime * 10%))?

huraken commented 3 years ago

I was thinking maybe about a new experimental option in the connection string, because if it's hardcoded it could be confusing to understand the ConnectionLifeTime option.

This option (ex: ConnectionLifeTimeJitter) will allow to define a percentage , for example, we'll have the following formula to know the real lifetime for each connection : Real ConnectionLifeTime = ConnectionLifeTime + Random( 0, ConnectionLifeTime * ConnectionLifeTimeJitter %) The default value could be 0 % so it will not enable jitter.

What do you think ?

mchavarriae commented 3 years ago

Hi @peppy do you have an example of how to use your connection pool?

I tried to copy and paste your code but I don't know what is Bancho.UseLocalConnectionPool and Bancho.LocalConnectionPoolMaxConnections?

Thanks

peppy commented 3 years ago

That's unrelated to usage, you should be able to remove or replace that.

mchavarriae commented 3 years ago

Yes you're right. I just commented out. Do you have an example of how to use your pool?

peppy commented 3 years ago

Sorry I don't. If you can't figure out why/what it is you probably should not be using it (and using the MySqlConnector pooling as provided).