vapor / async-kit

Sugary extensions for the SwiftNIO library
MIT License
71 stars 25 forks source link

Detect connection pool deadlocks #63

Closed grosch closed 4 years ago

grosch commented 4 years ago

Something like this will hang the client, and in db log we get

LOG: unexpected EOF on client connection with an open transaction

Client.find(id, on: req.db).flatMap {
    /// Validate we want to make updates
    req.db.transaction { db in
        // Do stuff

I don't really want to create a transaction if I don't have to. In the case where I discovered this I'm doing multiple queries against multiple tables, and only under certain conditions do I then create a transaction and update multiple different tables.

tanner0101 commented 4 years ago

Nesting a transaction should work as you have it there. Which db driver are you using?

grosch commented 4 years ago

Fluent

grosch commented 4 years ago

Sigh. And by Fluent I mean PostgreSQL

grosch commented 4 years ago

This is all it takes to cause deadlock in the app

func badMonkey(req: Request) throws -> EventLoopFuture<HTTPStatus> {
    req.db.transaction { db in
        Point.query(on: req.db).all().transform(to: .noContent)
    }
}

Notice how I used req.db inside the transaction.

tanner0101 commented 4 years ago

Hmm that's different from the original example. It makes sense that would not work since you should only use the db supplied to the closure while in a transaction. Try refactoring like this:

func badMonkey(req: Request) throws -> EventLoopFuture<HTTPStatus> {
    let points = Point.query(on: req.db).all()
    return req.db.transaction { db in
        points...
    }.transform(to: .noContent)
}

Or this:

func badMonkey(req: Request) throws -> EventLoopFuture<HTTPStatus> {
    req.db.transaction { db in
        Point.query(on: db).all().transform(to: .noContent)
    }
}

To clarify, for this to work you need two connections:

req.db.transaction { db in
    Point.query(on: req.db).all().transform(to: .noContent)
}

One in a transaction and one normal connection. These must be separate connections since once a connection is in a transaction all queries on that connection will be a part of the transaction.

This is deadlocking immediately since FluentPostgresDriver currently defaults to 1 connection max per pool: https://github.com/vapor/fluent-postgres-driver/blob/master/Sources/FluentPostgresDriver/FluentPostgresConfiguration.swift#L22

Increasing that number should stop this from always deadlocking, but I'm not sure you can avoid deadlocks here. Say you up the number to 2 max connections per event loop. Two requests come in at the same time to badMonkey and start the transaction. Now these two requests have consumed two connections and will be deadlocked waiting for another connection. To avoid deadlocks you must never have a pooled resource require requesting another pooled resource in order to release.

Fluent takes care of this if you use db instead of req.db while inside the transaction. Ideally we could provide some sort of error if we detect a deadlock is possible, but I'm not sure how that would be implemented.

grosch commented 4 years ago

Ideally we could provide some sort of error if we detect a deadlock is possible, but I'm not sure how that would be implemented.

Exactly. If you have some way to flag it happened, awesome, otherwise close this. I know I'm supposed to always use the db provided by the transaction but we run into trouble when updating a large method where we realize, "Oh, I need a transaction here" and so we wrap it, change most req.db to db but miss one.

tanner0101 commented 4 years ago

I'll leave this open as a reminder to try and find a way to detect deadlocks.

MrMage commented 4 years ago

Why would this cause deadlocks? My understanding from @tanner0101 's explanations was that each query should get a new connection. So this shouldn't cause problems as long as the pool has at least 2 connections.

Also, shouldn't the connection from the first query have been returned to the pool once the ode inside the flatMap block is run?

On 8. May 2020, at 18:21, grosch notifications@github.com wrote:

 Something like this will hang the client, and in db log we get

LOG: unexpected EOF on client connection with an open transaction

Client.find(id, on: req.db).flatMap { /// Validate we want to make updates req.db.transaction { db in // Do stuff I don't really want to create a transaction if I don't have to. In the case where I discovered this I'm doing multiple queries against multiple tables, and only under certain conditions do I then create a transaction and update multiple different tables.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

tanner0101 commented 4 years ago

You can create a deadlock by doing something like:

let pool = EventLoopConnectionPool(..., maxConnections: 1)
pool.withConnection { a in
    pool.withConnection { b in
        pool.eventLoop.makeSuccededFuture()
    }
}

Since the pool only has one connection and we're waiting for another connection from the pool before we can return the first connection, this code can never finish.

Fluent queries, when run on req.db, do pool.withConnection internally. So nesting req.db usage inside a transaction closure (which is inside a pool.withConnection closure) is similar to the above example and will require at least 2 connections.

MrMage commented 4 years ago

Since the pool only has one connection and we're waiting for another connection from the pool before we can return the first connection, this code can never finish.

Would that also apply if the pool has 2 or more connections? Wouldn't the second call to pool.withConnection just request a second connection from the pool in that case? (Personally, I run most of my pools with ~5 connections and drop "stale" connections from time to time when they have not been used for a while.)

tanner0101 commented 4 years ago

Yes, if the pool had a max of two connections, then that code would not be a problem.

To generalize, if you have a pool with n max connections and you request n + 1 connections from the pool, you will deadlock. In my example n = 1 (maxConnections: 1) and the code requires 2 connections from the pool at once to complete. Thus deadlock.

If you have n = 2 (maxConnections: 2), then you could create a deadlock by doing:

let pool = EventLoopConnectionPool(..., maxConnections: 2)
pool.withConnection { first in
    pool.withConnection { second in
        pool.withConnection { third in
            // we have three connections at the same time
            // this is impossible since maxConnections = 2
            print(first, second, third)
            ...
        }
    }
}
MrMage commented 4 years ago

Right, that is the way I would have expected it to work. As I said, my personal "solution" is to just make sure my connection pool is large enough...

tanner0101 commented 4 years ago

Yup that's a good solution. I do think it would be helpful if the connection pool could detect deadlocks to assist in debugging issues around a connection pool being too small. But I'm not sure that will be possible to implement.

MrMage commented 4 years ago

I do think it would be helpful if the connection pool could detect deadlocks to assist in debugging issues around a connection pool being too small. But I'm not sure that will be possible to implement.

Agreed, to both the usefulness and (sadly) the feasibility.

MrLotU commented 4 years ago

From taking a quick peek at the code, wouldn't simply keeping count of available connections - requested connections do the trick? That way if that number becomes 0 and yet another connection is requested, you can fail the future, correct?

Though I might be oversimplifying it here :)

MrMage commented 4 years ago

From taking a quick peek at the code, wouldn't simply keeping count of available connections - requested connections do the trick? That way if that number becomes 0 and yet another connection is requested, you can fail the future, correct?

Yes, but then any requests received while all e.g. 5 connections are momentarily in use would simply fail. So we would start to randomly drop connections any time all connections are "taken" rather than simply wait until they become available again.

However, I'd say that having a timeout for this waiting could be a good idea. That way, "deadlocked" requests could e.g. be released after, say, 10 seconds. Not ideal, but better than locking indefinitely, I guess.

MrLotU commented 4 years ago

Yes, but then any requests received while all e.g. 5 connections are momentarily in use would simply fail.

Very good point, totally overlooked that. In that case a timeout does sound better than deadlocking. This would at least free up the connections after x amount of time, instead of at some point deadlocking everything because there are no more connections that can be freed.

mxcl commented 4 years ago

I feel it should not be possible to deadlock Vapor apps if me or my developers accidentally use rq.db rather than db, a common and easy to make mistake; usually we are doing rq.db.

I wonder if it’s possible that once a transaction is started the value of db on rq could be changed by Vapor.

I understand some mitigation has been done in #67, but I'd rather see Vapor make design choices that prevent these kinds of scenarios in the first place. To err is human.

tanner0101 commented 4 years ago

It's a difficult balancing act. Vapor should try to prevent common errors but it should also not get in the way of the developer. If we silently swap out req.db for example, that would prevent you from being able to make any requests outside of the transaction even if you wanted to. I can imagine a few cases where that would be useful.

At least #67 has prevented applications from hanging forever if deadlocks are created. But I agree we can do better.

A few ideas:

"leaky" pooling

I've opened an issue for this here https://github.com/vapor/async-kit/issues/72. This allows the connection pool to temporarily open new connections beyond its maximum if needed.

detect deadlocks

Maybe there is some way for withConnection closures to share context and detect / prevent impending deadlocks. I'm less sure this is possible, but it's worth looking into.

researching other frameworks

How have other web frameworks dealt with this? Node.js / Java would probably be good places to look.

MrLotU commented 4 years ago

detect deadlocks

I think with the async nature of our connection pool this is going to be a tricky one, mostly because requesting n + 1 connection doesn't always mean a deadlock. It could also mean you'd get the n + 1th connection once the 1st is done.

researching other frameworks

Did a quick peek into Python's Peewee ORM, and they either block for a certain time, indefinitely, or raise an error when requesting n + 1 connections.

I don't think always raising when requesting n + 1 conns are requested is a good idea (see above). The timeout measures are something we have in place too (Except the option to block indefinite, though you could set the timeout to 24 hours or something stupid large like that).


Also looked at Node's Sequelize ORM, and they accept timeouts for acquiring a new connection, a min and max.

Full options from Sequelize | option | description | |--|--| options.pool.max | Maximum number of connection in pool options.pool.min | Minimum number of connection in pool options.pool.idle | The maximum time, in milliseconds, that a connection can be idle before being released. options.pool.acquire | The maximum time, in milliseconds, that pool will try to get connection before throwing error options.pool.evict | The time interval, in milliseconds, after which sequelize-pool will remove idle connections. options.pool.validate | A function that validates a connection. Called with client. The default function checks that client is an object, and that its state is not disconnected options.pool.maxUses | The number of times a connection can be used before discarding it for a replacement, used for eventual cluster rebalancing.

For Java I found the Hibernate ORM and the following documentation on connection pooling.

The documentation mentions two options here. C3PO provides a min and max size, as well as a timeout for getting a connection. Proxool accepts a max and minimum connection count.


Overall seems none of the ORMs fully prevent deadlocks, other than timeouts when you are in a deadlock and I think this is the approach we should follow too

mxcl commented 4 years ago

I don’t agree that this is an acceptable compromise, but this is not my project, you have your own list of priorities that you adhere to.

In our app I wrote a wrapper that prevents this from happening and am ok with that solution.

tanner0101 commented 4 years ago

@mxcl is the wrapper in your app following a similar approach to replacing req.db? Also interested for your thoughts on #72 and if that seems like something that would benefit you.

mxcl commented 4 years ago

We’re using https://github.com/candor/sublimate which is the “synchronous” solution I was talking about ~6 months ago. Inside of which the db is swapped out once a transaction begins.

Overall I aim for reliability above all, so this is why the set of trade offs for me end up there, but I certainly understand every library and framework has different base policies that effect everything above that. We’re all experienced in releasing such tooling here and I have much respect for this project.