mysql-net / MySqlConnector

MySQL Connector for .NET
https://mysqlconnector.net
MIT License
1.39k stars 335 forks source link

Add lock-free connection pool #462

Open bgrainger opened 6 years ago

bgrainger commented 6 years ago

See example from @sebastienros here: https://github.com/sebastienros/MySqlConnector/commit/ffe984fcaff4f7571b19d75f5ae356c3f377c61c

Also implement the optimisation to stagger probes into the pool from @roji: https://github.com/npgsql/npgsql/commit/19c436e41a64392c7e9a068882976880e2d3d96b#diff-33bb36ea0d45f66ae316561892d5b5a9R236

roji commented 6 years ago

Wow, @sebastienros and @bgrainger, I didn't know you were working on a lock-free pool...

The more I work and think about the pooling issue, the more I'm convinced there needs to be a totally common component (https://github.com/dotnet/corefx/issues/26714) shared across providers (in fact, I'm thinking about a generic pooling API that wouldn't even be specific to ADO.NET/databases). I'm planning to spend some time on making my implementation a bit more generic to go in this direction, any reviewing/feedback would be really great (here's my pool). I think that if you guys are planning to work on the pool it may be a thing we should try to do together, and maybe even go towards a shared component.

Anyway, @sebastienros I've taken a quick look, here are some comments:

bgrainger commented 6 years ago

It seems that (by default) MySqlConnector resets connection state every time a pooled connection is opened. If I'm reading the code right, this is an actual database roundtrip that's done every time.

That's correct.

Npgsql used to work like this, until I changed it to "enqueue" the reset messages and to send them as part of the first statement after the connection is reopened. This effectively eliminates a roundtrip for each pooled open/close, which is pretty significant.

My last attempt at improving this was to reset the connection in the background: #178. I think queueing the reset message is a better approach except...

I occasionally receive complaints from users that the pool sometimes broken connections

... performing the reset connection while retrieving a connection from the pool has the very nice side-effect that the connection is actually open when Open returns...

a connection may broken at any point

... unless of course there is a catastrophic network/server failure a nanosecond after Open returns.

In theory, I agree that users should expect that any database call can fail and that appropriate retry logic needs to be implemented. In practice, it seems like there's a lot of code that just assumes everything's going to be just fine if Open returns without throwing an exception; pushing that exception (for a broken pooled connection) down to the first call to ExecuteReader etc. is a performance-vs-reliability tradeoff that (currently) doesn't seem worth making by default.

(I'm all for allowing users to opt into high-performance at the potential cost of correctness or reliability if they determine they absolutely need it. However, it seems best to be "safe" by default. And FWIW MySqlConnector is already faster than Connector/NET for retrieving a pooled connection even with ConnectionReset=true. So MySQL users can get safety and better performance by switching ADO.NET providers, then opt into higher performance if they need it.)

sebastienros commented 6 years ago

It's more of a POC to see how it would benefit performance, but your comments are great to go towards a better final implementation.

bgrainger commented 6 years ago

there needs to be a totally common component shared across providers

I'd love to be able to reuse a high-performance, well-tested, already-debugged component (as opposed to having to develop and support my own implementation).

It would be nice to be able to reuse common code without introducing a new dependency. Are NuGet source-only packages still a thing? (I've not used them in the past, and I'm not sure if they made it into the new world or not.) Git submodules might be an answer? (They're not without their own problems, of course.)

roji commented 6 years ago

(I'm all for allowing users to opt into high-performance at the potential cost of correctness or reliability if they determine they absolutely need it. However, it seems best to be "safe" by default. And FWIW MySqlConnector is already faster than Connector/NET for retrieving a pooled connection even with ConnectionReset=true. So MySQL users can get safety and better performance by switching ADO.NET providers, then opt into higher performance if they need it.)

I agree that there's a bit of a tradeoff here, although it seems to be the SqlClient behavior:

If a connection exists to a server that has disappeared, this connection can be drawn from the pool even if the connection pooler has not detected the severed connection and marked it as invalid. This is the case because the overhead of checking that the connection is still valid would eliminate the benefits of having a pooler by causing another round trip to the server to occur. When this occurs, the first attempt to use the connection will detect that the connection has been severed, and an exception is thrown.

A similar note can be found in the ODBC driver developer guidelines:

When the Driver Manager is pooling connections, it needs to be able to determine if a connection is still working before handing out the connection. Otherwise, the Driver Manager keeps on handing out the dead connection to the application whenever a transient network failure occurs. A new connection attribute has been defined in ODBC 3.x: SQL_ATTR_CONNECTION_DEAD. This is a read-only connection attribute that returns either SQL_CD_TRUE or SQL_CD_FALSE. The value SQL_CD_TRUE means that the connection has been lost, while the value SQL_CD_FALSE means that the connection is still active. (Drivers conforming to earlier versions of ODBC can also support this attribute.)

A driver must implement this option efficiently or it will impair the connection pooling performance. Specifically, a call to get this connection attribute should not cause a round trip to the server. Instead, a driver should just return the last known state of the connection. The connection is dead if the last trip to the server failed, and not dead if the last trip succeeded.

As I said above, in Npgsql this can be somewhat mitigated via keepalive, an opt-in feature which does a roundtrip on connections every X seconds (even when in the pool). The advantage of this is that it does not happen on the code path of Open() or Close() - it's a totally "out-of-band" operation that checks your connection without affecting performance.

But you're right that some users may be surprised here, I'd at least recommend an opt-out from this for high-perf apps.

It would be nice to be able to reuse common code without introducing a new dependency. Are NuGet source-only packages still a thing? (I've not used them in the past, and I'm not sure if they made it into the new world or not.) Git submodules might be an answer? (They're not without their own problems, of course.)

Having this as a source-only nuget is a great idea, I think they still exist (see this thread).

In any case I'll look into making my own pool less coupled to PostgreSQL etc.

bgrainger commented 6 years ago

I agree that there's a bit of a tradeoff here, although it seems to be the SqlClient behavior

Thanks for this pointer; it's helpful.

bgrainger commented 5 years ago

@roji By the time you told me you were planning to work on abstracting a connection pool, I had already ported most of the code over to MySqlConnector (but I didn't mention it, since it was still WIP). I decided to clean it up today and start to run some performance tests: #614.

They also say you shouldn't create an abstraction until you've implemented it 3 times (or something like that) so it might be helpful to see how I needed to tweak the implementation to meet my specific needs. (Hopefully not breaking it in the process.)

bgrainger commented 5 years ago

@sebastienros I've tested these commits with TFB Docker locally, but I don't think I have sufficient system resources (particularly with the overhead of Docker for Windows) to see if the changes make a difference. (That is, I'm think already able to push my system to 100% usage and making the code faster doesn't result in any more requests per second.)

If you're able, can you test the following commits and report back the speeds achieved? I'm interested to know if any individual commit results in a significant performance increase or regression:

roji commented 5 years ago

/cc @divega @ajcvickers @bricelam

It's great to see this, and I'm really looking forward to seeing the perf impact here.

FWIW we don't have to do an actual reusable component with an abstraction - porting the code could be a very good start, and as you say having an implement on 2 providers is a much better place to start from if we do want a reusable component. My plate is quite full at the moment anyway, let's see where your efforts lead - I may also end up taking some of your improvements back into Npgsql at some point (e.g. the LIFO behavior).

bgrainger commented 5 years ago

@sebastienros I've taken those six commits and applied them to (a fork of) FrameworkBenchmarks (using a submodule), if that makes testing easier:

sebastienros commented 5 years ago

I ran all these branches on our local physical machines, which are not the TE ones but the ratio in perf will be the same.

Only the RPS and Average lantency numbers matter in this case

| Description |     RPS | CPU (%) | Memory (MB) | Avg. Latency (ms) | Startup (ms) | First Request (ms) | Latency (ms) | Ratio |
| ----------- | ------- | ------- | ----------- | ----------------- | ------------ | ------------------ | ------------ | ----- |
|   TE master | 106,406 |      93 |         395 |              2.67 |          834 |             491.85 |         1.35 |  1.00 | 
|     c2beb99 | 105,860 |      92 |         389 |              2.68 |          796 |             517.37 |         1.15 |  0.99 |  
|     2863905 | 105,049 |      94 |         392 |              2.70 |          804 |             528.89 |         1.21 |  0.99 | 
|     6ea700c | 108,851 |      93 |         386 |              2.66 |          795 |             517.35 |         1.41 |  1.02 | 
|     0b4ee38 | 105,834 |      92 |         401 |              2.71 |          810 |              525.6 |         1.40 |  0.99 | 
|     51a85d5 | 104,385 |      91 |         391 |              2.74 |          798 |             525.96 |         1.31 |  0.98 | 
|     8881b4b | 105,104 |      98 |         386 |              2.62 |          815 |             509.18 |         1.32 |  0.99 |
sebastienros commented 5 years ago

For future reference (note to myself) I am adding the command line I used to run these:

dotnet run -- --server http://asp-perf-lin:5001 --client http://asp-perf-load:5002 `
-r https://github.com/bgrainger/FrameworkBenchmarks@master#8881b4b `
--docker-file frameworks/CSharp/aspnetcore/aspcore-ado-my.dockerfile `
--docker-image aspnetcore_ado_my `
--docker-context frameworks/CSharp/aspnetcore/ `
--ready-text "Application started." `
--clientName bombardier `
--path /fortunes --port 8080 `
--diff TechEmpower --save 8881b4b
sebastienros commented 5 years ago

If you expected more variance please provide a PR that makes things really bad to ensure I ran these correctly.

bgrainger commented 5 years ago

@sebastienros Here's a PR with Thread.Sleep(1) added; it should reduce performance by an order of magnitude:

https://github.com/bgrainger/FrameworkBenchmarks/commit/f288f97ff51d2110e1954a9b4359854cd8ea6d09

sebastienros commented 5 years ago

It works, which means the previous results are valid

|     RPS | CPU (%) | Memory (MB) | Avg. Latency (ms) | Startup (ms) | First Request (ms) | Latency (ms) | Ratio |
| ------- | ------- | ----------- | ----------------- | ------------ | ------------------ | ------------ | ----- |
| 106,406 |      93 |         395 |              2.67 |          834 |             491.85 |         1.35 |  1.00 |
|  16,731 |      28 |         402 |             24.92 |          791 |             520.86 |         2.37 |  0.16 |
mkromkamp commented 5 years ago

Is this still actively discussed and/or developed? I'm having a certain workload that benefits from higher throughput/rps even it would mean a slight payoff in latency.

If this is still on-going I'm willing to do some testing to see if it could benefit such use-cases in real life. :+1:

bgrainger commented 5 years ago

The code is at https://github.com/bgrainger/MySqlConnector/commits/lock-free-pool (except for the last commit on that branch).

It wasn't clear to me from the results above that this actually improved performance, so given the added complexity I decided not to integrate it yet. If you have a scenario where you can build MySqlConnector from source and test it, I'm very interested to hear your results.

Note that MySqlConnector logs over 200k RPS on the TFB Benchmarks (see aspcore-ado-my) so it's already capable of sustaining high throughput. What kind of numbers are you seeing and what are you hoping to achieve?

mkromkamp commented 5 years ago

The code is at https://github.com/bgrainger/MySqlConnector/commits/lock-free-pool (except for the last commit on that branch).

It wasn't clear to me from the results above that this actually improved performance, so given the added complexity I decided not to integrate it yet. If you have a scenario where you can build MySqlConnector from source and test it, I'm very interested to hear your results.

Thanks for pointing this out. It is also not clear for me if the there are real benifits in the implementation. Just wanted to reach out to see if the conversation would benefit from some additional testing. I will take a look at building the library from source and hook it into my application. If I have any meaningfull information to share I will drop you a message over here.

Note that MySqlConnector logs over 200k RPS on the TFB Benchmarks (see aspcore-ado-my) so it's already capable of sustaining high throughput. What kind of numbers are you seeing and what are you hoping to achieve?

That is impressive, and from looking at the benchmark, it is not clear MySqlConnector is used over here. Might be worth adding this somewhere on the website and/or readme on GitHub. Do you know anything about the setup they are using(MySql config, how many webservers, connection configuration, etc)? Might give some great information how to squeeze every bit of performance out of such a situation :+1:

bgrainger commented 5 years ago

TechEmpower server hardware is at https://www.techempower.com/benchmarks/#section=environment:

Citrine (rounds 16 onward) Three homogeneous Dell R440 servers each equipped with an Intel Xeon Gold 5120 CPU, 32 GB of memory, and an enterprise SSD. Dedicated Cisco 10-gigabit Ethernet switch. Provided by Microsoft.

I believe they use one for the benchmarking client, one for the web framework being tested, and one for the database server. I have not found any information on how they're tuning MySQL for performance in this environment.

bgrainger commented 5 years ago

I will take a look at building the library from source and hook it into my application.

Note that the lock-free code branched from the main library somewhere between 0.49.3 and 0.50.0; there shouldn't be any major performance changes between those two versions but obviously some of the newer features won't be present. It looks like it won't currently merge cleanly.