vapor / postgres-kit

🐘 Non-blocking, event-driven Swift client for PostgreSQL.
MIT License
188 stars 70 forks source link

Connection reset by peer not resolving #164

Closed mxcl closed 3 years ago

mxcl commented 4 years ago

Sorry if wrong repo.

I recently engaged in some conversation about this issue on Discord. We get a few customers a day that experience this issue (where Postgres has dropped the connection, but Fluent doesn’t know and then the query fails).

I would like to fix it, otherwise I can just see one of our investors using our product and freaking out about the “Something went wrong” error it generates in our React front end.

I believe a solution would be: checking if the connection is still open before using it, and if so I can begin exploring that and submit a PR. LMK if I'm in the wrong layer or heading in the wrong direction. Thanks.

lixiaoyu commented 4 years ago

Hello @mxcl,i meet similar issues recently,but i am using sqlite.

can you make it more detail about "checking connection",if we need reopen connection if is closed?

0xTim commented 4 years ago

@tanner0101 bump

mxcl commented 4 years ago

using sqlite

Well this makes it seem like a more fundamental issue, how can you lose connections to a local db?

Or it's a coincidence and not the same underlying issue.

Or sqlite connections operate the same.

tanner0101 commented 4 years ago

I believe a solution would be: checking if the connection is still open before using it

@mxcl from what I can tell in Vapor 4 and 3 we already check if a connection is closed before attempting to use it:

4: https://github.com/vapor/async-kit/blob/master/Sources/AsyncKit/ConnectionPool/EventLoopConnectionPool.swift#L179-L180

3: https://github.com/vapor/database-kit/blob/1/Sources/DatabaseKit/ConnectionPool/DatabaseConnectionPool.swift#L88

Vapor 3, however, caches the connection requested from the pool for the lifetime of the request. Maybe that's a problem? Vapor 4 requests and releases a connection from the pool each time a query is run.

Perhaps we need to look into automatic retry functionality in Fluent if we detect a connection has failed due to networking errors?

mxcl commented 4 years ago

Vapor 3, however, caches the connection requested from the pool for the lifetime of the request. Maybe that's a problem?

From my experience of seeing this (locally where it happens much more often), typically it is happening when I have gone to lunch and come back, and then it happens with the first request I make, I'll observe more thoroughly next time it happens to try and provide more info.

Vapor 4 requests and releases a connection from the pool each time a query is run.

I look forward to trying vapor 4 out, though sadly, since this is a professional project, it may be some time before we can afford the port. Still, sounds like the change in behavior may help.

tanner0101 commented 4 years ago

typically it is happening when I have gone to lunch and come back, and then it happens with the first request I make, I'll observe more thoroughly next time it happens to try and provide more info.

Hmm... that would definitely seem like the connection pool is not correctly noticing the connection is closed then. It could be an issue with how connection.isClosed is determined in v3.

The other thing that might support this is that you're getting Connection reset by peer. In the Postgres code it's checking isClosed before sending Postgres messages over the wire: https://github.com/vapor/postgres-kit/blob/1/Sources/PostgreSQL/Connection/PostgreSQLConnection.swift#L79-L81

If isClosed is true, then you should be getting this error:

PostgreSQLError(identifier: "closed", reason: "Connection is closed.")

isClosed is being determined by channel.closeFuture: https://github.com/vapor/postgres-kit/blob/1/Sources/PostgreSQL/Connection/PostgreSQLConnection.swift#L42-L47

In v4, we calculate this with !channel.isActive. Maybe backporting that to v3 would help?

mdavies-solsys commented 4 years ago

I too am using Vapor 3 in a production environment and running into many of these same errors. We are using MySQL. I will be digging into this further to get more details.

Unrelated to these events, we also recently started experimenting with running a db proxy in front of a cluster of dbs. When introducing the proxy we kept seeing migrations fail due to the statements being closed, not connections. I don't know if this is related but the information may help. I have a screen shot of a packet capture during migrations, in a single packet there is a request to both execute the statement and close the statement. It appears that something within the future is causing the statements to close before the response from the db, and when the db responds vapor crashes.

I feel like these are related, and the original issue here may not be a dropped connection from the db, but instead a closed statement before receiving the response. I hope I'm not too off point in this thread! Messages Image(4130236756)

tanner0101 commented 4 years ago

@mdavies-solsys is that packet capture between Vapor and the DB proxy or the proxy and the DB? I don't see anywhere in code that we're sending a close statement message. We specifically re-use the "unnamed" statement ("") that shouldn't need to be closed.

mdavies-solsys commented 4 years ago

@tanner0101 This was between vapor and the proxy. I'll let @gazza1davies chime in here as he actually captured the packets. Upon first glance it seemed that perhaps both commands having the same packet number was only honoring the first in MySQL while the proxy honored both. We'd love to be of more assistance if possible! The proxy we're using is ProxySQL.

MadGeorge commented 4 years ago

I experience this issue on regular basis during development. I have postgres db running on remote server and Vapor app running on my mac. Average case is - i build and run my project from xcode, leave it for an hour or two, come back and then every request to the server routes failed with http status 500 read(descriptor:pointer:size:) failed: Connection reset by peer. Routes not involved to database querying just works. I should move this app to production soon (in next three days). What should i do to eliminate this issue? I tried rewrite all parts database related code and wrap all query, save and delete with req.withPooledConnection(to: .psql) { con in but it does not have any effect.

mohok commented 4 years ago

@MadGeorge Exactly same on my case. Did you solve this problem?

MadGeorge commented 4 years ago

@mohok hi. no, i did not solve this problem for dev environment. But i can confirm the issue does not reproduced for production. I containerised vapor via docker swarm (tow linked containers vapor+postgresql on the same host) and Connection reset by peer does not happen. On my mac this i still face this every 1-3 hours by the way.

MadGeorge commented 4 years ago

@mohok, @tanner0101 i was wrong. today i found many Connection reset by peer errors in production environment. The only reason i though it was gone, my server continue to work somehow and reconnect to db and may response for requests after error, when in local environment all requests will fail if error Connection reset by peer will raised until server restart.
CleanShot 2020-04-24 at 16 53 38

tanner0101 commented 4 years ago

I've tried to recreate this a couple of times now but I haven't been able to. Here's what I did:

1: Clone vapor/api-template and swap from SQLite to Postgres 2: Boot up a Postgres DB using this docker-compose file. 3: Insert a Todo. Make ~50 requests to get all todos. 4: Wait 1-3 hours 5: Make more requests.

I expected to see this failure when making more requests after waiting, but it always seems to work. Any additional tips on what I can do to recreate this?

MadGeorge commented 4 years ago

@tanner0101 thank you for the answer.

3: Insert a Todo. Make ~50 requests to get all todos.

It does not look like exception depends on requests count. Regular case is very similar @mxcl described:

I have gone to lunch and come back, and then it happens with the first request I make.

Time between attempts to make request fail is not exact. Some time i need to leave server without requests for an hour, but yesterday i faced read(descriptor:pointer:size:) failed: Connection reset by peer (errno: 104) in very short interval between successful connections. By feel it needs to leave launched server without requests from 30 to 60 mins and then first requests with db access will rase error.

My current application is mobile api backend. The very first client request is POST /api/notifications/token and GET /api/categories. Order of this queries does not specified. The first one (../token) failed more often, since it first, but sometimes they both are failed.

POST /api/notifications/token handled via this controller:

final class NotificationsCntrl {
    func createOrUpdate(_ req: Request) throws -> Future<DeviceToken> {
        try req.content.decode(DeviceTokenReq.self).flatMap { toSave in
            req.withPooledConnection(to: .psql) { con in
                DeviceToken.query(on: con)
                    .filter(\.deviceID, .equal, toSave.deviceID)
                    .first()
                    .flatMap { found in
                        if let found = found {
                            return found.update(with: toSave).save(on: con)
                        }
                        return DeviceToken(req: toSave).save(on: con)
                }
            }
        }
    }
}

and GET /api/categories looks like:

final class CategoriesCntrl {
    func all(_ req: Request) throws -> Future<[Category]> {
        req.withPooledConnection(to: .psql) { con in
            let productIDs = ProductOption.query(on: con)
                .filter(\.stock, .greaterThan, 0)
                .all()
                .map { $0.map { $0.productID }}

            let categoriesIDs = productIDs.flatMap { ids in
                CategoryProduct.query(on: con)
                    .filter(\.productID ~~ ids)
                    .all()
                    .map { $0.map { $0.categoryID }}
            }

            return categoriesIDs.flatMap { ids in
                Category.query(on: con)
                    .filter(\.id ~~ ids)
                    .all()
            }
        }
    }
}

Before your answer i used image: postgres in my compose file, which leads to postgresql v 12. Today i migrated to image: postgres:11 to be sure there are no affect of PG version difference. Errors still happened.

I can confirm - there are no difference if my requests are sent to local app (iMac Xcode 11.4.1) or to production server (Ubuntu server with docker swarm). Both will raise similar error from time to time.

The only thing i did not tried is to not use not dockerised postgres.

I tried wrap req.withPooledConnection(to: .psql) like so

class Wrap {
    private var count = 0

    func pooledConnection<T>(for req: Request, closure: @escaping (PostgreSQLConnection) throws -> Future<T>) -> Future<T> {
        return req.withPooledConnection(to: .psql) { con in
            do {
                return try closure(con)
            } catch {
                if self.count > 2 {
                    return req.eventLoop.newFailedFuture(error: error)
                }

                (try? req.make(Logger.self))?.info("Error: count: \(self.count): \(error.localizedDescription)")
                (try? req.make(Logger.self))?.report(error: error, request: req, verbose: false)

                self.count += 1
                return self.pooledConnection(for: req, closure: closure)
            }
        }
    }
}

and then use

Wrap().pooledConnection(for req) { con in

But it does not looks like i ever may intercept target error. Either i did it wrong or error raised at the deeper level either it may be intercepted under the hood and sent to response via newFailedFuture.

If it will help, i also attaching my error middleware class. The notable thing is, discussed error is subject of default case in ErrorBodyMiddleware.default switch error statement, so it is not the AbortDataError, or AbortError and is not Debuggable, since it does not fall into case let debuggable in Xcode environment.

ErrorBodyMiddleware.swift Middleware needs to send custom body response for some error cases ```swift public final class ErrorBodyMiddleware: Middleware, ServiceType { internal struct ErrorResponse: Codable { var error: Bool var reason: String var data: Data? } public static func makeService(for worker: Container) throws -> ErrorBodyMiddleware { return try .default(environment: worker.environment, log: worker.make()) } public static func `default`(environment: Environment, log: Logger) -> ErrorBodyMiddleware { return .init { req, error in log.report(error: error, verbose: environment.isRelease.not) let status: HTTPResponseStatus let reason: String let headers: HTTPHeaders let data: Data? switch error { case let abort as AbortDataError: reason = abort.reason status = abort.status headers = abort.headers data = abort.data case let abort as AbortError: reason = abort.reason status = abort.status headers = abort.headers data = nil case let validation as ValidationError: reason = validation.reason status = .badRequest headers = [:] data = nil case let debuggable as Debuggable where !environment.isRelease: reason = debuggable.reason status = .internalServerError headers = [:] data = nil default: reason = "Something went wrong." status = .internalServerError headers = [:] data = nil } // create a Response with appropriate status let res = req.response(http: .init(status: status, headers: headers)) // attempt to serialize the error to json do { if let data = data { res.http.body = HTTPBody(data: data) } else { let errorResponse = ErrorResponse(error: true, reason: reason, data: data) res.http.body = try HTTPBody(data: JSONEncoder().encode(errorResponse)) } res.http.headers.replaceOrAdd(name: .contentType, value: "application/json; charset=utf-8") } catch { res.http.body = HTTPBody(string: "Oops: \(error)") res.http.headers.replaceOrAdd(name: .contentType, value: "text/plain; charset=utf-8") } emailError(req, res) return res } } private static func emailError(_ req: Request, _ res: Response) { //... } private let closure: (Request, Error) -> (Response) public init(_ closure: @escaping (Request, Error) -> (Response)) { self.closure = closure } public func respond(to req: Request, chainingTo next: Responder) throws -> Future { let response: Future do { response = try next.respond(to: req) } catch { response = req.eventLoop.newFailedFuture(error: error) } return response.catchFlatMap { error in if let response = error as? ResponseEncodable { do { return try response.encode(for: req) } catch { return req.future(self.closure(req, error)) } } else { return req.future(self.closure(req, error)) } } } } ```
tanner0101 commented 4 years ago

@MadGeorge thanks for the extra info. I tried again to reproduce this error but I'm still not able to. Here's what I tried:

Databases:

Methods:

Errors reported in vapor logs. No errors surfaced to API. New connections were created automatically.

Connections still worked fine after coming back.

Errors reported in logs and surfaced as 500 to API. Once Wifi was re-enabled, new connections were created and all requests began working again.

Errors repoted in logs and surfaced as 500 to API after db was shutdown. Once DB was booted back up, new connections were created and all requests began working again.

Queries ran slower but still no API errors.

Nothing changed when using this.


I'm not sure what else to try here. In none of my tests was I able to recreate the Connection reset by peer (errno: 104) error.

schwmi commented 4 years ago

Hi,

just wanted to add here - we are also getting those errors rarely but consistently (from ~50.000 commits a day ~ 10 errors). We are using PostgreSQL 1.5.0 and Vapor 3.3.2. I access the database via DatabaseConnectionPool and store/load data via fluent.

Our app is deployed on Heroku (stack 18), build pack (https://github.com/vapor/heroku-buildpack). Postgres (also hosted on heroku, standard-0 plan) version 10.11 (with SSL connection).

MadGeorge commented 4 years ago

@tanner0101 thanks, Tanner. On prod server i just added cron task which triggers /api/ping every minute. On those endpoint app perform light select query to db. No errors. But requests from mobile apps still facing errors from time to time. CleanShot 2020-04-30 at 11 23 59

I give up. I start to thought it is not Vapor/Postgres or Vapor/Fluent problem. It starts to look like Vapor connection problem. I noticed interesting case - when i move mobile application into background, i can see same Connection reset by peer message in the logs. I will try to ask about at Vapor repo. Thank you one more time.

tanner0101 commented 4 years ago

just wanted to add here - we are also getting those errors rarely but consistently (from ~50.000 commits a day ~ 10 errors).

@schwmi to clarify, after you see this error does your app stop working? The main issue I'm concerned about here is this report by @MadGeorge

build and run my project from xcode, leave it for an hour or two [...] then every request to the server routes failed with http status 500 read(descriptor:pointer:size:) failed: Connection reset by peer

To see "connection reset by peer" sometimes is normal. Connections can fail and that is why we use connection pools that are capable of replenishing themselves automatically. However, if failed connections are not being replenished and your app becomes unresponsive, that is definitely a problem.

To further complicate things, the main issue that started this thread is different (we probably should have separated the threads). @mxcl was concerned about surfacing these connection reset messages. Currently if a connection fails (which is normal) during a query, this error will surface as a 500 to the client. Vapor 4 improves its closed connection checking logic so this should be less of a problem. Still, a way to auto-retry failed queries in Fluent could be a good feature to have if you really want to avoid surfacing these errors.

Hopefully this makes sense.


TL;DR:

"connection reset" errors are normal and connection pools should automatically replace closed connections. If this is not happening, that's a problem.

schwmi commented 4 years ago

Thanks for the clarification, app doesn't stop working after those messages. So no bug from our side.

mxcl commented 4 years ago

Having ported to Vapor 4, I’d say this problem is now in fact worse.

Frequent connection closed log messages followed by that request punting a 500 back to the client.

I'm sure sometimes this is unavoidable, but I've used other backend solutions to Vapor where I never saw this problem.

MadGeorge commented 4 years ago

@mxcl can not bee... is it true? i am working on migration to vapor 4 right now in hope to fix this issue. your news is terrifying me.

tanner0101 commented 4 years ago

@mxcl hmm that's surprising. I would have figured the better conn.isClosed logic would mean the pool would replace dead connections before trying to use them. Can you share your log messages? Is there any way you could temporarily enable --log trace and catch a few of these happening? I'd like to make sure the conn pool logic is working as expected before digging into other solutions like auto retry.

tanner0101 commented 4 years ago

I just did a quick sanity check on the vapor template. To make testing this easy, I used an ELG with only one thread.

First request to create a connection and store it in the pool:

[ codes.vapor.application ] [ INFO ] GET /todos [request-id: 938F39CB-D79C-4DD8-AA66-B7E5544A97AD] (Vapor/Responder/DefaultResponder.swift:42)
[ codes.vapor.application ] [ DEBUG ] No available connections on this event loop, creating a new one [request-id: 938F39CB-D79C-4DD8-AA66-B7E5544A97AD, database-id: psql] (AsyncKit/ConnectionPool/EventLoopConnectionPool.swift:202)

Second request to ensure the pooled connection is being used:

[ codes.vapor.application ] [ INFO ] GET /todos [request-id: 3B14EA55-36BC-4D30-8A06-5C04546703DE] (Vapor/Responder/DefaultResponder.swift:42)
[ codes.vapor.application ] [ TRACE ] Re-using available connection [request-id: 3B14EA55-36BC-4D30-8A06-5C04546703DE, database-id: psql] (AsyncKit/ConnectionPool/EventLoopConnectionPool.swift:191)

I then ran this command to kill off all active connections:

SELECT 
    pg_terminate_backend(pid) 
FROM 
    pg_stat_activity 
WHERE 
    -- don't kill my own connection!
    pid <> pg_backend_pid()
    -- don't kill the connections to other databases
    AND datname = 'vapor_database';

Finally, a third request that will be offered a dead connection:

[ codes.vapor.application ] [ INFO ] GET /todos [request-id: 33B37791-9B76-4080-A75A-DB29187BCA5A] (Vapor/Responder/DefaultResponder.swift:42)
[ codes.vapor.application ] [ DEBUG ] Pruning available connection that has closed [request-id: 33B37791-9B76-4080-A75A-DB29187BCA5A, database-id: psql] (AsyncKit/ConnectionPool/EventLoopConnectionPool.swift:195)
[ codes.vapor.application ] [ DEBUG ] No available connections on this event loop, creating a new one [request-id: 33B37791-9B76-4080-A75A-DB29187BCA5A, database-id: psql] (AsyncKit/ConnectionPool/EventLoopConnectionPool.swift:202)

As the logs show, the dead connection is correctly identified and pruned. No 500 thrown.

tikhop commented 4 years ago

I'm facing the same issue. Everything works well on a local machine (my mac book pro) whereas on production server I'm getting the same error as OP quite often.

[ ERROR ] Uncaught error: read(descriptor:pointer:size:): Connection reset by peer (errno: 104) (PostgresNIO/Connection/PostgresConnection+Connect.swift:47)
[ ERROR ] NIOPostgres error: connection closed [request-id: 440FDC63-5D10-46DB-BB72-4CDC0341D3E5] (App/Middleware/ErrorMiddleware.swift:46)

Here is the full log:

[ TRACE ] Decoded HTTP part: head: HTTPRequestHead {SOME POST REQUEST } (Vapor/HTTP/Server/HTTPServerRequestDecoder.swift:34) 
[ TRACE ] Decoded HTTP part: body: ByteBuffer { readerIndex: 0, writerIndex: 782, readableBytes: 782, capacity: 782, storageCapacity: 2048, slice: _ByteBufferSlice { 461..<1243 }, storage: 0x00007ff1540bbe20 (2048 bytes) } (Vapor/HTTP/Server/HTTPServerRequestDecoder.swift:34) 
[ TRACE ] Decoded HTTP part: end (Vapor/HTTP/Server/HTTPServerRequestDecoder.swift:34) 

[ INFO ] POST /api/v1/auth/siwa [request-id: 0AFC64B0-DB47-48B8-B2BB-A2376FFE5437] (Vapor/Responder/DefaultResponder.swift:42) 

[ DEBUG ] New data received [request-id: 0AFC64B0-DB47-48B8-B2BB-A2376FFE5437] (Vapor/HTTP/EndpointCache.swift:121)
[ DEBUG ] query read auth_authorization_credential filters=[auth_authorization_credential[apple_user_identifier] = Optional("SOMEID")] limits=[count(1)] [database-id: psql, request-id: 0AFC64B0-DB47-48B8-B2BB-A2376FFE5437] (FluentKit/Query/Builder/QueryBuilder.swift:283)

[ TRACE ] Re-using available connection [database-id: psql, request-id: 0AFC64B0-DB47-48B8-B2BB-A2376FFE5437] (AsyncKit/ConnectionPool/EventLoopConnectionPool.swift:191) 

[ DEBUG ] SELECT <<SOME QUERY>>  [database-id: psql, request-id: 0AFC64B0-DB47-48B8-B2BB-A2376FFE5437] (PostgresNIO/PostgresDatabase+Query.swift:124)

[ TRACE ] Encoded: PostgresMessage (P) (PostgresNIO/Message/PostgresMessageEncoder.swift:34)
[ TRACE ] Encoded: PostgresMessage (D) (PostgresNIO/Message/PostgresMessageEncoder.swift:34) 
[ TRACE ] Encoded: PostgresMessage (B) (PostgresNIO/Message/PostgresMessageEncoder.swift:34)
[ TRACE ] Encoded: PostgresMessage (E) (PostgresNIO/Message/PostgresMessageEncoder.swift:34)
[ TRACE ] Encoded: PostgresMessage (S) (PostgresNIO/Message/PostgresMessageEncoder.swift:34)

[ ERROR ] Uncaught error: read(descriptor:pointer:size:): Connection reset by peer (errno: 104) (PostgresNIO/Connection/PostgresConnection+Connect.swift:47)

[ TRACE ] Encoded: PostgresMessage (X) (PostgresNIO/Message/PostgresMessageEncoder.swift:34)
[ TRACE ] Releasing connection [database-id: psql, request-id: 0AFC64B0-DB47-48B8-B2BB-A2376FFE5437] (AsyncKit/ConnectionPool/EventLoopConnectionPool.swift:262) 

[ ERROR ] NIOPostgres error: connection closed [request-id: 0AFC64B0-DB47-48B8-B2BB-A2376FFE5437] (App/Middleware/ErrorMiddleware.swift:46) 

[ TRACE ] Unhandled user event: outputClosed (Vapor/HTTP/Server/HTTPServerRequestDecoder.swift:212)

On production server I use Ubuntu 18 and docker. Postgres inside a docker container as well.

Thanks

HashedViking commented 3 years ago

Hello @MadGeorge, @tanner0101 and others, have you found the solution for this problem? I'm experiencing the same on production Linux server. App and Postgres DB are dockerised in Swarm mode. Idling connections (~ 5 min is enough) are being reset by peer. I have no luck capturing any errors within flatMapError. Trace example:

[ DEBUG ] <<SOME SELECT>> (PostgresNIO/PostgresDatabase+Query.swift:124)
[ TRACE ] Encoded: PostgresMessage (P) (PostgresNIO/Message/PostgresMessageEncoder.swift:34)
[ TRACE ] Encoded: PostgresMessage (D) (PostgresNIO/Message/PostgresMessageEncoder.swift:34)
[ TRACE ] Encoded: PostgresMessage (B) (PostgresNIO/Message/PostgresMessageEncoder.swift:34)
[ TRACE ] Encoded: PostgresMessage (E) (PostgresNIO/Message/PostgresMessageEncoder.swift:34)
[ TRACE ] Encoded: PostgresMessage (S) (PostgresNIO/Message/PostgresMessageEncoder.swift:34)
[ ERROR ] Uncaught error: read(descriptor:pointer:size:): Connection reset by peer (errno: 104) (PostgresNIO/Connection/PostgresConnection+Connect.swift:47)
[ TRACE ] Encoded: PostgresMessage (X) (PostgresNIO/Message/PostgresMessageEncoder.swift:34)
[ TRACE ] Releasing connection [database-id: dbMaster, request-id: CAE06649-6070-453D-ACE7-C6A282573160] (AsyncKit/ConnectionPool/EventLoopConnectionPool.swift:262)
[ ERROR ] NIOPostgres error: connection closed (App/Middleware/EventLimitMiddleware.swift:23)
MadGeorge commented 3 years ago

@HashedViking hi. my solution was... switching to ibm kitura. no reactiveness in the code but no connections errors too. side effect was - i lost money on rewriting entire project. we have deal with open source buy the way, so we all take the risk consciously.

HashedViking commented 3 years ago

Sorry, it's sad to hear. In my case it's a side project mobile app that I was making more than a year. On the local machine (mac pro) everything is ok, but on the cloud it hits me with that nasty bug. The last thing I can think of before switching my entire backend 🤕 is to dig into postgres-kit or postgres-nio and try to find that bug.

MadGeorge commented 3 years ago

On the local machine (mac pro) everything is ok, but on the cloud it hits me with that nasty bug.

exactly the same, but i was able to face same error couple of time locally when leave project running for a long time and then perform request with http client. but i never able to reproduce it as stable case.

And i also used vapor as mobile backend, JFYI.

0xTim commented 3 years ago

cc @gwynne

HashedViking commented 3 years ago

ok, "6 hours later" I have some ugly solution. I tried to switch the docker image to ubuntu:16.04 (Xenial) - no luck. Then I focused solely on Postgres connections. In my case the problem arises when some connections stay idle for 6+ minutes. For some reason Vapor either don't prune them or prune right after Connection reset by peer error is thrown and 500 error is returned to a client.

What I did:

  1. connect to database via psql: psql -U username -d dbname
  2. show current connections to the db: SELECT pid, state, state_change FROM pg_stat_activity WHERE pid <> pg_backend_pid() AND datname = 'dbname';
  3. For monitoring each second right after command above do this: \watch 1
  4. on my 2 core Ubuntu Bionic (18.04) server with dockerised Nginx, Vapor 4, Postgres 12 idle connections count is capped at 8

    Tue 01 Dec 2020 07:04:01 PM UTC (every 1s)
    
    pid | state |         state_change
    -----+-------+-------------------------------
    156 | idle  | 2020-12-01 18:58:34.724857+00
    159 | idle  | 2020-12-01 18:59:10.786254+00
    162 | idle  | 2020-12-01 18:59:13.808146+00
    163 | idle  | 2020-12-01 18:59:13.824784+00
    164 | idle  | 2020-12-01 18:59:25.101309+00
    165 | idle  | 2020-12-01 18:59:25.198705+00
    116 | idle  | 2020-12-01 19:01:13.157704+00
    123 | idle  | 2020-12-01 19:01:15.860991+00
    (8 rows)
  5. after manual pruning of connections inside DB with command below (as @tanner0101 did earlier) Vapor successfully pruning them from cache too SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid <> pg_backend_pid() AND datname = 'vapor';
  6. finally, inspired by this stackoverflow post I came up with this snippet in configure.swift:
    app.eventLoopGroup.next().scheduleRepeatedTask(initialDelay: .minutes(4), delay: .minutes(3)) { _ in        
        app.logger.info("Releasing old connections to master")
        _ = app.db(.master).withConnection { conn -> EventLoopFuture<Void> in
            return conn.raw("WITH inactive_connections AS ( SELECT pid FROM pg_stat_activity WHERE pid <> pg_backend_pid( ) AND application_name !~ '(?:psql)|(?:pgAdmin.+)' AND datname = current_database() AND usename = current_user AND state in ('idle', 'idle in transaction', 'idle in transaction (aborted)', 'disabled') AND current_timestamp - state_change > interval '3 minutes' ) SELECT pg_terminate_backend(pid) FROM inactive_connections;").run()
        }
    }

    sql request that terminates all connections related to current user and current db that are idle more than for 3 minutes:

    WITH inactive_connections AS (
    SELECT
        pid
    FROM 
        pg_stat_activity
    WHERE
        -- Exclude the thread owned connection (ie no auto-kill)
        pid <> pg_backend_pid( )
    AND
        -- Exclude known applications connections
        application_name !~ '(?:psql)|(?:pgAdmin.+)'
    AND
        -- Include connections to the same database the thread is connected to
        datname = current_database() 
    AND
        -- Include connections using the same thread username connection
        usename = current_user 
    AND
        -- Include inactive connections only
        state in ('idle', 'idle in transaction', 'idle in transaction (aborted)', 'disabled') 
    AND
        -- Include old connections (found with the state_change field)
        current_timestamp - state_change > interval '3 minutes' 
    )
    SELECT
    pg_terminate_backend(pid)
    FROM
    inactive_connections 

TL;DR Add this to configure.swift:

app.eventLoopGroup.next().scheduleRepeatedTask(initialDelay: .minutes(4), delay: .minutes(3)) { _ in        
        app.logger.info("Releasing old connections to master")
        _ = app.db(.master).withConnection { conn -> EventLoopFuture<Void> in
            return conn.raw("WITH inactive_connections AS ( SELECT pid FROM pg_stat_activity WHERE pid <> pg_backend_pid( ) AND application_name !~ '(?:psql)|(?:pgAdmin.+)' AND datname = current_database() AND usename = current_user AND state in ('idle', 'idle in transaction', 'idle in transaction (aborted)', 'disabled') AND current_timestamp - state_change > interval '3 minutes' ) SELECT pg_terminate_backend(pid) FROM inactive_connections;").run()
        }
    }
HashedViking commented 3 years ago

@mxcl @tikhop @tanner0101 @MadGeorge @schwmi After some digging I'm sure it's a Docker Swarm issue related to Linux IPVS load balancer which removes idle connections after 15 minutes. Here are some related links:

  1. thorough explanation by Mirantis
  2. related post on Stackoverflow

Haven't tried it yet, but my solution is similar to one of suggestions by Mirantis: Reconfigure or modify affected TCP client applications to close idle TCP sessions prior to the 15 minute timeout.

gwynne commented 3 years ago

EDIT: See the below comment from @rnantes for a much better solution specific to the source of the problem.

~Interesting explanation from Mirantis! That does sound like the cause of this issue.~

~At the moment there is no support in Fluent or its drivers for timing out idle database connections; for MySQL and Postgres - and Mongo as well, I believe - this can be configured on the database's side and is better done there. (SQLite connections do not represent network channels and thus do not time out in any event.)~

~Configuring MySQL's connection idle timeout globally:~

# Add to your server's `my.cnf`:
wait_timeout = 840 # 14 minutes

~Configuring MySQL's connection idle timeout on a per-connection basis:~

-- Run this query each time you request a `Database` from Fluent
SET @@session.wait_timeout = 840

~Configuring PostgreSQL's connection idle timeout and keepalive interval globally (the idle timeout will only apply to sessions currently in transactions):~

# Add this to your `postgresql.conf`:
idle_in_transaction_session_timeout = 14m
tcp_keepalives_interval = 14m

~Configuring PostgreSQL's per-connection idle timeout:~

# Add these lines to a `~/.psqlrc`:
idle_in_transaction_session_timeout = 14m
tcp_keepalives_interval =14m
rnantes commented 3 years ago

Hi All,

I was running into this using Docker Swarm and I found discussion on the topic here: https://github.com/moby/moby/issues/31208

Apparently the problem occurs because Docker Swarm defaults to closing inactive connections after 15 minutes.

I found a solution to the problem from a linked PR to another project here: https://github.com/ITISFoundation/osparc-simcore/pull/1284

The solution that worked for me was to add a sysctls section to the postgres service section of docker-compose.yml file.

db:
  image: postgres:12.4
  sysctls:
    # NOTES: these values are needed here because docker swarm kills long running idle 
    # connections by default after 15 minutes see https://github.com/moby/moby/issues/31208
    # info about these values are here https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
    - net.ipv4.tcp_keepalive_intvl=600
    - net.ipv4.tcp_keepalive_probes=9
    - net.ipv4.tcp_keepalive_time=600
HashedViking commented 3 years ago

Perfect, I confirm that it works. It should be added to the Docker Swarm deploy section in Vapor docs as a warning or a tip.

0xTim commented 3 years ago

Does anyone on this thread still have this issue or can we now close it?

tikhop commented 3 years ago

I'm good and I can confirm that it was a docker related issue — the same machine, but without docker works like a charm.

0xTim commented 3 years ago

Ok I'm going to close this. If anyone hits this or is still experiencing this we can reopen

mxcl commented 3 years ago

Just got this in prod. It has been a while since I last saw it, so hopefully that means it is mostly fixed.

Vapor 4.41.4 PostgresFluentDriver 2.1.2 PostgresNIO 1.5.1 PostgresKit 2.3.0

Not using Docker, running raw on Ubuntu 20.04 with Swift 5.3 app compiled in release configuration.

I acknowledge that sometimes this is probably unavoidable, though I dislike thinking it can happen at all and thus would prefer as much mitigation as possible. If this happens enough and I have time I will certainly be debugging it.

finestructure commented 3 years ago

Just to add another data point to this issue: we encountered this exact problem and error message when switching from docker-compose to docker swarm and setting the net.ipv4 parameters fixed it.