taoensso / carmine

Redis client + message queue for Clojure
https://www.taoensso.com/carmine
Eclipse Public License 1.0
1.15k stars 131 forks source link

The atomic macro's extra connection can exhaust thread pools #135

Closed gfredericks closed 9 years ago

gfredericks commented 9 years ago

Intentionally or unintentionally (I can't tell) the way that the atomic macro ends up expanding to nested wcar calls means that any use of atomic will require at least two connections from the pool.

This doesn't sound terrible, but it can have racey effects when using message queue workers whenever there are more workers than the connections in the pool: as many workers as possible acquire their first connection until the pool is exhausted, and then they all block trying to acquire their second connection.

I tweaked the wcar macro to reuse its connection for nested calls, and this caused my tests to pass. I can't seem to successfully run the carmine test suite so I can't verify that it passes as well.

I don't understand the carmine codebase well enough to guess whether this is a satisfactory solution (e.g., is it important for some reason that there be a separate connection used for atomic things?). If it is, I can prepare a PR.

Otherwise, should we add a warning to the message queue docs about the pool size? Or is there another approach to solving the problem?

ptaoussanis commented 9 years ago

Hi Gary,

What makes you think that atomic's using nested wcar calls? I'm looking at the macro now (pre-coffee, granted) - but not seeing any nested connections?

May help to see what change you've made.

Also, have you seen issue #127? Hoping to cut a release for that later today. Seems like the problem you're seeing may be related?

gfredericks commented 9 years ago

Atomic itself does not used nested wcar calls; when I said that I was assuming there was no reason to use atomic unless you intended to call wcar in the body, which would therefore be nested (inside the wcar that atomic itself sets up).

I saw #127 and therefore did a few tests with the 2.4 release of the commons lib, and saw the same behavior.

ptaoussanis commented 9 years ago

atomic takes a body which is executed using the top-level connection. Have you seen the example in the atomic docstring? Maybe it'd help if you could show me some code so I can see what you're trying to do?

gfredericks commented 9 years ago

The body supplied to atomic can (and does, in the case of handle1) have calls to wcar, which open a new connection.

To demonstrate the problem I created this commit which logs connection pool activity, and then added a scratch.clj on this branch which tries to send a single message over a queue with a specified connection pool size.

In the results below, you can see two connections being acquired in the case where the pool is large enough (2), and the execution hangs trying to acquire the second connection in the case where the pool is not large enough (1).

Pool Size 2

$ CARMINE_POOL_SIZE=2 lein do clean, run -m clojure.main scratch.clj 2>/dev/null
BORROW :scratch
RETURN :scratch
2015-Jun-04 12:19:44 +0000 gary1 INFO [taoensso.carmine.message-queue] - Message queue worker starting: :my-queue
BORROW :start-polling-loop-1
RETURN :start-polling-loop-1
BORROW :start-polling-loop-1
RETURN :start-polling-loop-1
BORROW :start-polling-loop-1
RETURN :start-polling-loop-1
BORROW :start-polling-loop-2
RETURN :start-polling-loop-2
BORROW :start-polling-loop-1
RETURN :start-polling-loop-1
BORROW :start-polling-loop-1
RETURN :start-polling-loop-1
Received :my-msg
2015-Jun-04 12:19:45 +0000 gary1 INFO [taoensso.carmine.message-queue] - Message queue worker stopped: :my-queue
BORROW :atomic-outer
BORROW :handle1
RETURN :handle1
RETURN :atomic-outer

Pool Size 1

$ CARMINE_POOL_SIZE=1 lein do clean, run -m clojure.main scratch.clj 2>/dev/null
BORROW :scratch
RETURN :scratch
2015-Jun-04 12:18:52 +0000 gary1 INFO [taoensso.carmine.message-queue] - Message queue worker starting: :my-queue
BORROW :start-polling-loop-1
RETURN :start-polling-loop-1
BORROW :start-polling-loop-1
RETURN :start-polling-loop-1
Received :my-msg
BORROW :atomic-outer
2015-Jun-04 12:18:53 +0000 gary1 INFO [taoensso.carmine.message-queue] - Message queue worker stopped: :my-queue
BORROW :handle1
gfredericks commented 9 years ago

The potential fix I mentioned uses a dynamic var to keep track of the connection for the scope of the wcar call, so that nested calls can reuse it. This commit is my proof-of-concept for that idea.

My issue-135-patched contains that fixed merged into the issue-135 branch from my previous comment. Here is the new output for pool size 1 (which is the same as the output for pool size 2):

$ CARMINE_POOL_SIZE=1 lein do clean, run -m clojure.main scratch.clj 2>/dev/null
BORROW :scratch
RETURN :scratch
2015-Jun-04 15:22:11 +0000 gary1 INFO [taoensso.carmine.message-queue] - Message queue worker starting: :my-queue
BORROW :start-polling-loop-1
RETURN :start-polling-loop-1
Received :my-msg
2015-Jun-04 15:22:11 +0000 gary1 INFO [taoensso.carmine.message-queue] - Message queue worker stopped: :my-queue
BORROW :atomic-outer
RETURN :atomic-outer
ptaoussanis commented 9 years ago

Hey Gary, thanks a lot for the additional info - that helps!

Will try clarify a few things:

Intentionally or unintentionally (I can't tell) the way that the atomic macro ends up expanding to nested wcar calls means that any use of atomic will require at least two connections from the pool.

The atomic macro doesn't expand to nested wcar calls unless you specifically ask it to (i.e. unless you include wcar calls in the atomic macro's body.

Atomic itself does not used nested wcar calls; when I said that I was assuming there was no reason to use atomic unless you intended to call wcar in the body, which would therefore be nested (inside the wcar that atomic itself sets up).

There are indeed reasons to use atomic without wcar calls in the body; that's actually the commonest use case. The body is executed within the context of the atomic macro's (implicit) connection. The atomic docstring illustrates this, but might not be clear enough? I'll try beef up the explanation.

The body supplied to atomic can (and does, in the case of handle1) have calls to wcar, which open a new connection.

So this is actually the fault of handle1, not atomic. It is indeed a fault btw; you'll sometimes want a nested wcar call - but you don't need one here (in handle1). The correct solution is to switch handle1's nested wcar to a with-replies call - I'll do that momentarily.

as many workers as possible acquire their first connection until the pool is exhausted, and then they all block trying to acquire their second connection

This is something I hadn't considered, thanks for bringing it to my attention. It's definitely a bug.

I might also suggest bumping your pool size btw. Even without the 2 conns/thread, worker threads poll (block) by design - so you'll too easily end up connection starved unless your pool limit is >> your worker thread count.

gfredericks commented 9 years ago

Thanks Peter! I think all of this makes sense. I hadn't studied atomic too closely since I wasn't myself writing code that used it, I just took for granted that handle1 was using it correctly. So the documentation may well be sufficient for someone looking to use it directly.

It's definitely also helpful to know that the pool size should be greater than the thread count. I had always figured that was probably a good idea, but not strictly necessary.

Thanks again.

ptaoussanis commented 9 years ago

I just took for granted that handle1 was using it correctly.

That should have been a safe assumption ;-)

Have cut a v2.11.1 release: https://github.com/ptaoussanis/carmine/releases - would you mind confirming that it solves your issue?

Thanks again for the report, cheers! :-)

gfredericks commented 9 years ago

2.11.1 looks to have fixed the problem, thanks!

ptaoussanis commented 9 years ago

Great, thanks for the confirmation.