Hang in batch `Command`

finestructure commented 1 year ago

This is a follow-up of a Discord discussion, FYI @0xTim !

We're seeing the following error in one of our Vapor batch jobs every once in a while (a few times per month, maybe once a week - hard to quantify precisely):

│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │

When this message occurs, the job HANGS - i.e. it locks up and does not terminate, preventing all further processing.

In normal operation the job spins up and typically processes for 3-6 seconds, running a number of queries against the Github APIs (we're processing batches of 25 Swift packages, updating Github metadata in the Swift Package Index).

I've done a little research and there seem to be a couple of cases where this can happen:

1) too many db connections 2) an error handling problem in networking code

I suspect it's 2). Last I checked we're not exhausing our db connections and our db utilization across our two envs is <15% and <10%.

I believe this issue first started happening after moving to async/await. This may be due to how we're launching async processes from a Command via our own AsyncCommand. (We previously used a semaphore-based implementation which also encountered hangs.)

Our logs show no unusual activity around the error, although we've not yet had it happen with DEBUG logging enabled.

We're also tracking this issue here: https://github.com/SwiftPackageIndex/SwiftPackageIndex-Server/issues/2227, and I'll be adding more observations as they happen there.

finestructure commented 1 year ago

Just happened with debug logs, attached.

One thing that I notice is that there's an exactly 10s gap between the last DEBUG log and the Connection request timed out error. That looks like a default timeout and perhaps the length can give a hint which part of the system is timing out (network vs db connection - if that's even a debate, I don't know).

hang-prod-2023-04-27.logs.zip

finestructure commented 1 year ago

Ok, I actually have a local reproducer for this now. This happens because a synchronous part of our task is running too long, a git clone run in a shell. I can also force the issue more reliably by adding a sleep(11) (> the 10s timeout) to the synchronous part.

I'm unsure how to best deal with this though. This needs to be run as part of the task as a whole but it's not clear to me how to offload it to unblock the ELF and resume processing on the ELF once it's done.

I guess I could increase the timeout but to some degree that's just kicking the can down the road until the next bigger repo / network glitch comes along.

I'd like to avoid the hangs, as they require manual restarting of the job.

0xTim commented 1 year ago

@finestructure have you tried submitted the long running job to Vapor's thread pool to offload it from the event loop? https://api.vapor.codes/vapor/documentation/vapor/application/threadpool

That should at least free up the underlying event loop and stop it hanging. If that doesn't work how are you shelling out to git clone?

finestructure commented 1 year ago

Ah, that's exactly what I was going to try and figure out tomorrow - how would I offload this properly. Thanks a lot for the pointer!

finestructure commented 1 year ago

Note to future self re-reading this comment: This has been explained by Gwynne on Discord. The use of the transaction is indeed what triggers the error but the other code path only doesn't, because it doesn't actually request a connection due to the way Fluent deals with the connection pool.

So using application.threadPool doesn't change anything. I may be holding it wrong but while constructing a standalone app to make it easier to discuss I found that the issue seems to be related to db transactions.

If my workload is

                        try await database.transaction { tx in
                            print("sleeping")
                            sleep(11)
                        }

the connection deadlock error is triggered. If it's simply

                        print("sleeping")
                        sleep(11)

it isn't.

Note that there's no actual db activity involved at all. The example runs without a db.

I may have oversimplified the example and be triggering something that only looks like our issue but it certainly looks like what we're seeing in the full app (which is also using a transaction).

The reproducer (a small vapor app) is here: https://github.com/finestructure/debug-2227

finestructure commented 1 year ago

I may have oversimplified the example and be triggering something that only looks like our issue but it certainly looks like what we're seeing in the full app (which is also using a transaction).

I'm pretty sure now this is our core issue. Luckily it was easy to rearrange the code (in this instance) to pull the slow sync workload out of the transaction and do it up front before the db access. Once I do that, the deadlock errors also disappear in my local reproducer with our actual analysis code.

finestructure commented 1 year ago

Looks like this issue is actually two separate issues in a trench coat. After working around the connection deadlock error, it (re-)uncovered an underlying crash that is likely the cause for the hang. I've added some more details in our tracking issue and will try to get a reproducer: https://github.com/SwiftPackageIndex/SwiftPackageIndex-Server/issues/2227#issuecomment-1541400276

finestructure commented 1 year ago

Update: the connection deadlock errors are also still happening. We've just had it on prod in conjunction with a hang. The earlier report was on dev where it hung without any connection deadlock errors. Could be that the fix I deployed yesterday just made these error messages less likely and they're unrelated to the hangs.

More details here: https://github.com/SwiftPackageIndex/SwiftPackageIndex-Server/issues/2227#issuecomment-1541576199

finestructure commented 1 year ago

One thing I don't think I mentioned yet: When the hangs happen with the connection deadlock error, the process is consuming CPU:

There's no apparent activity though, no logs, no forward process.

When the hang happened earlier on dev without the connection deadlock message, it hung but the process' CPU was 0%.

gwynne commented 1 year ago

This is the effect of the connectionPoolTimeout parameter of FluentPostgresConfiguration (usually seen as app.databases.use(.postgres(...)), which defaults to 10 seconds.

finestructure commented 1 year ago

Yes, and I managed to work around this particular part of the problem (the connection deadlock), I believe.

The real problem for us is that our batch job (the Vapor command) hangs and needs manual restarting. I have my doubts this is actually related to the connection deadlock now.

Also note the change in behaviour if the slow job is within a db transaction or not (https://github.com/vapor/async-kit/issues/104). Feels like there's something amiss there 🤔

finestructure commented 1 year ago

The weird thing is that this timeout seems to be dependent on whether I'm using a db transaction or not (see the comment and the test project above: https://github.com/vapor/async-kit/issues/104).

I'll bump the parameter up for now to see if that helps with the frequency of the hangs. Thank you!

gwynne commented 1 year ago

Not sure why this is in console-kit; it belongs on async-kit, transferring there.

0xTim commented 1 year ago

@finestructure Is this still an issue? Have you tried with the latest async kit? That had a load of improvements to connection pooling to stop these kind of issues

finestructure commented 1 year ago

Unfortunately, it is. We're on async-kit 1.18.0 and our analysis job on dev is hanging right now. I've started tracking just our prod occurrences of it here: https://github.com/SwiftPackageIndex/SwiftPackageIndex-Server/issues/2227#issuecomment-1675743222.

I'm unsure what else I could do to track this down - and it's infrequent enough that it's just easier to respond and restart the job when it happens that dig into this further. Let me know if there's anything I can do to gather more info!

gwynne commented 1 year ago

My current best guess is that this is happening because the queues package doesn't actually interact with the fluent provider package- the result being that when you use context.application.db() with a QueueContext, you end up with a database that's sourced from the connection pool of whatever event loop the application happens to come up with at that moment, rather than the one the queue is actually running on. (Why this would matter, I haven't yet figured out; I mentioned this was a(n educated) guess, right? 😰)

@finestructure To test this theory,

Add this extension to your code:

extension FluentKit.Databases {
public func db(_ id: DatabaseID?, on eventLoop: any EventLoop) -> any Database {
    self.databases
        .database(
            id,
            logger: self.logger,
            on: eventLoop,
            history: self.fluent.history.historyEnabled ? self.fluent.history.history : nil,
            pageSizeLimit: self.fluent.pagination.pageSizeLimit
        )!
}
}

In your Job (or AsyncJob), use context.application.db(.psql, on: context.eventLoop) anywhere you'd otherwise use context.application.db().

If that fixes it, I will look into adding appropriate support to the packages in question.

finestructure commented 1 year ago

Thanks for taking a look, @gwynne ! We're not using the queues package. This is a "vapor command" that we're running on a schedule (every ~20s). Its runtime is a few seconds.

  analyze:
    image: registry.gitlab.com/finestructure/swiftpackageindex:${VERSION}
    <<: *shared
    depends_on:
      - migrate
    entrypoint: ["/bin/bash"]
    command: ["-c", "--",
      "trap : TERM INT; while true; do ./Run analyze --env ${ENV} --limit ${ANALYZE_LIMIT:-25}; sleep ${ANALYZE_SLEEP:-20}; done"
    ]
    deploy:
      resources:
        limits:
          memory: 4GB
      restart_policy:
        max_attempts: 5
    networks:
      - backend

Would your theory still apply?

gwynne commented 1 year ago

@finestructure Even more so, actually, since by default a Command doesn't run on any event loop (and an AsyncCommand will be running on a Concurrency task). In either case, you can still use the extension I suggested, but at the entry point to your command, do:

let eventLoop = context.application.eventLoopGroup.any())
let db = context.application.db(.psql, on: eventLoop)

Then make sure to use db everywhere (and use eventLoop in any place you might need to refer to an event loop, if any). The goal is the same: Make sure you do everything that's related to an event loop (versus a Concurrency task context etc.) on the same event loop.

gwynne commented 1 year ago

This has been resolved by other work.

vapor / async-kit

Hang in batch `Command` #104