rebus-org / Rebus.PostgreSql

:bus: PostgreSQL persistence for Rebus
https://mookid.dk/category/rebus
Other
17 stars 19 forks source link

Transport #1

Closed jmkelly closed 7 years ago

jmkelly commented 7 years ago

This is a crack at getting PostgreSql to handle the rebus transport. As it stands though, I wouldn't be merging this, as I hit a wall when it comes to how PostgreSql / Rebus handles the db connections, so if anyone has some bright ideas, I'll all ears.

As it stands, in the TestPostgreSqlTransportReceivePerformance, if you specify in the test the number of messages above the postgresql show max_connections number, it will exhaust the connection pool (but it seems to vary as to when the pool is exhausted). This is traditionally set to 100 in new installs, so I could just set to higher (I believe in the connection string) but I think this is hiding other issues.

Reading up on Npgsql, it should return connections to the pool once they are closed, so I'm thinking it is something to do with my connection logic which I largely copied from the SqlServer implementation.

The queue pop is implemented using PG 9.5s skip locked syntax so this won't work < 9.5 but performance should be up there with SqlServer if not better.


Rebus is MIT-licensed. The code submitted in this pull request needs to carry the MIT license too. By leaving this text in, I hereby acknowledge that the code submitted in the pull request has the MIT license and can be merged with the Rebus codebase.

jmkelly commented 7 years ago

This may be related to https://github.com/npgsql/npgsql/issues/1232. That thread gives some good ideas (and I'll update to a newer version of npgsql) to see if it's fixed

mookid8000 commented 7 years ago

It's super awesome that you are doing this πŸ˜„

I can recall having come across someone complaining that Npgsql had a leak in its connection pool – I have no idea whether it could be related to this issue though

mookid8000 commented 7 years ago

Good to see the tests pass πŸ‘ πŸ˜„

Was it just a matter of increasing the timeout?

mookid8000 commented 7 years ago

PS: it's a little bit slow, but it's not THAT slow....

I receive around 650 msg/s on my machine. IIRC with the SQL Server transport I can receive ~1000 msg/s, so it's definitely in the same league.

jmkelly commented 7 years ago

Turns out if you don't set the maximum pool size in your connection string, npgsql seems to assume a value that is far too high. I've set to 30 (which is supposed to be the default) and not got problems and the tests now pass.

jmkelly commented 7 years ago

Maybe I need to tweak the pg settings a little. I'm only getting 324 msg/s on a stock standard pg install on windows, but getting 871 msg/s on Sql Server (both on localhost).

mookid8000 commented 7 years ago

If it's OK with you, I'd like to merge it now

mookid8000 commented 7 years ago

....unless you want to tweak first?

mookid8000 commented 7 years ago

Aaah.... check this out:

readonly AsyncBottleneck _bottleneck = new AsyncBottleneck(Environment.ProcessorCount);

What this means (on my machine) is that only four asynchronous receive operations will be allowed at any point in time. Since the Receive method does a couple of awaits, the CPUs can be much better utilized by allowing more async operations – e.g.

readonly AsyncBottleneck _bottleneck = new AsyncBottleneck(20);

makes this:

10000 messages received in 20,1 s - that's 498,3 msg/s
10000 messages received in 17,1 s - that's 584,5 msg/s
10000 messages received in 23,4 s - that's 427,9 msg/s

into this:

10000 messages received in 9,3 s - that's 1079,6 msg/s
10000 messages received in 8,9 s - that's 1129,8 msg/s
10000 messages received in 9,5 s - that's 1047,2 msg/s
mookid8000 commented 7 years ago

The AsyncBottleneck was introduced in the SQL Server transport because otherwise the connection pool would quickly be depleted, resulting in timeouts when trying to get the next connection.

jmkelly commented 7 years ago

ok, I set the bottleneck the same as SqlServer based on your comment. I thought it would be better set to processor count but that was just a stab early on and I forgot to change it back (this is my first go at async await stuff, so it's been a good learning experience)

What code you getting those perf numbers from? Just want to make sure we are on the same page and talking about the same numbers, mostly because I'm not seeing the speedup you have on the tests I've got.

Happy for you to merge anytime though.

mookid8000 commented 7 years ago

I got the perf numbers simply by running the NizzleName test case with the specified number of messages...

jmkelly commented 7 years ago

Thanks. That was the one I was using as well, but I'm still only getting ~400 msg/s. May be something to do with that maximum pool setting, I'll have a play with tomorrow and see if I can get those number up a bit.

jmkelly commented 7 years ago

I'll leave as is for the moment.

If I knock up the workers on the receiver, I've manage to get a lot more throughput. But I can't seem to explain why sometimes, I get ~250 msg/s per worker, then other times I only get ~50 msg/s per worker. I'd say there is something going on in postgresql in play, could be something to do with the vacuum. With few more workers though, can get over 1000 msg/s total quite comfortably.

Looking at the graphs provided by the BigSql installation, when we go into the receive phase, there are approx 3x the amount of selects per second as apposed to the deletes per second, and this stays constant. The number of deletes matches the messages / s processed, so this may be somewhere to look for performance gains.

mookid8000 commented 7 years ago

Thanks for the contribution πŸ‘ it's out in 2.1.0 now πŸ˜„

jmkelly commented 7 years ago

Awesome!