Closed finestructure closed 1 year ago
Just happened with debug logs, attached.
One thing that I notice is that there's an exactly 10s gap between the last DEBUG log and the Connection request timed out
error. That looks like a default timeout and perhaps the length can give a hint which part of the system is timing out (network vs db connection - if that's even a debate, I don't know).
Ok, I actually have a local reproducer for this now. This happens because a synchronous part of our task is running too long, a git clone
run in a shell. I can also force the issue more reliably by adding a sleep(11)
(> the 10s timeout) to the synchronous part.
I'm unsure how to best deal with this though. This needs to be run as part of the task as a whole but it's not clear to me how to offload it to unblock the ELF and resume processing on the ELF once it's done.
I guess I could increase the timeout but to some degree that's just kicking the can down the road until the next bigger repo / network glitch comes along.
I'd like to avoid the hangs, as they require manual restarting of the job.
@finestructure have you tried submitted the long running job to Vapor's thread pool to offload it from the event loop? https://api.vapor.codes/vapor/documentation/vapor/application/threadpool
That should at least free up the underlying event loop and stop it hanging. If that doesn't work how are you shelling out to git clone?
Ah, that's exactly what I was going to try and figure out tomorrow - how would I offload this properly. Thanks a lot for the pointer!
Note to future self re-reading this comment: This has been explained by Gwynne on Discord. The use of the transaction is indeed what triggers the error but the other code path only doesn't, because it doesn't actually request a connection due to the way Fluent deals with the connection pool.
So using application.threadPool
doesn't change anything. I may be holding it wrong but while constructing a standalone app to make it easier to discuss I found that the issue seems to be related to db transactions.
If my workload is
try await database.transaction { tx in
print("sleeping")
sleep(11)
}
the connection deadlock
error is triggered. If it's simply
print("sleeping")
sleep(11)
it isn't.
Note that there's no actual db activity involved at all. The example runs without a db.
I may have oversimplified the example and be triggering something that only looks like our issue but it certainly looks like what we're seeing in the full app (which is also using a transaction).
The reproducer (a small vapor app) is here: https://github.com/finestructure/debug-2227
I may have oversimplified the example and be triggering something that only looks like our issue but it certainly looks like what we're seeing in the full app (which is also using a transaction).
I'm pretty sure now this is our core issue. Luckily it was easy to rearrange the code (in this instance) to pull the slow sync workload out of the transaction and do it up front before the db access. Once I do that, the deadlock errors also disappear in my local reproducer with our actual analysis code.
Looks like this issue is actually two separate issues in a trench coat. After working around the connection deadlock
error, it (re-)uncovered an underlying crash that is likely the cause for the hang. I've added some more details in our tracking issue and will try to get a reproducer: https://github.com/SwiftPackageIndex/SwiftPackageIndex-Server/issues/2227#issuecomment-1541400276
Update: the connection deadlock
errors are also still happening. We've just had it on prod
in conjunction with a hang. The earlier report was on dev
where it hung without any connection deadlock
errors. Could be that the fix I deployed yesterday just made these error messages less likely and they're unrelated to the hangs.
More details here: https://github.com/SwiftPackageIndex/SwiftPackageIndex-Server/issues/2227#issuecomment-1541576199
One thing I don't think I mentioned yet: When the hangs happen with the connection deadlock
error, the process is consuming CPU:
There's no apparent activity though, no logs, no forward process.
When the hang happened earlier on dev
without the connection deadlock
message, it hung but the process' CPU was 0%.
This is the effect of the connectionPoolTimeout
parameter of FluentPostgresConfiguration
(usually seen as app.databases.use(.postgres(...))
, which defaults to 10 seconds.
Yes, and I managed to work around this particular part of the problem (the connection deadlock
), I believe.
The real problem for us is that our batch job (the Vapor command) hangs and needs manual restarting. I have my doubts this is actually related to the connection deadlock
now.
Also note the change in behaviour if the slow job is within a db transaction or not (https://github.com/vapor/async-kit/issues/104). Feels like there's something amiss there 🤔
The weird thing is that this timeout seems to be dependent on whether I'm using a db transaction or not (see the comment and the test project above: https://github.com/vapor/async-kit/issues/104).
I'll bump the parameter up for now to see if that helps with the frequency of the hangs. Thank you!
Not sure why this is in console-kit; it belongs on async-kit, transferring there.
@finestructure Is this still an issue? Have you tried with the latest async kit? That had a load of improvements to connection pooling to stop these kind of issues
Unfortunately, it is. We're on async-kit 1.18.0 and our analysis job on dev is hanging right now. I've started tracking just our prod occurrences of it here: https://github.com/SwiftPackageIndex/SwiftPackageIndex-Server/issues/2227#issuecomment-1675743222.
I'm unsure what else I could do to track this down - and it's infrequent enough that it's just easier to respond and restart the job when it happens that dig into this further. Let me know if there's anything I can do to gather more info!
My current best guess is that this is happening because the queues
package doesn't actually interact with the fluent
provider package- the result being that when you use context.application.db()
with a QueueContext
, you end up with a database that's sourced from the connection pool of whatever event loop the application happens to come up with at that moment, rather than the one the queue is actually running on. (Why this would matter, I haven't yet figured out; I mentioned this was a(n educated) guess, right? 😰)
@finestructure To test this theory,
extension FluentKit.Databases {
public func db(_ id: DatabaseID?, on eventLoop: any EventLoop) -> any Database {
self.databases
.database(
id,
logger: self.logger,
on: eventLoop,
history: self.fluent.history.historyEnabled ? self.fluent.history.history : nil,
pageSizeLimit: self.fluent.pagination.pageSizeLimit
)!
}
}
Job
(or AsyncJob
), use context.application.db(.psql, on: context.eventLoop)
anywhere you'd otherwise use context.application.db()
.If that fixes it, I will look into adding appropriate support to the packages in question.
Thanks for taking a look, @gwynne ! We're not using the queues
package. This is a "vapor command" that we're running on a schedule (every ~20s). Its runtime is a few seconds.
analyze:
image: registry.gitlab.com/finestructure/swiftpackageindex:${VERSION}
<<: *shared
depends_on:
- migrate
entrypoint: ["/bin/bash"]
command: ["-c", "--",
"trap : TERM INT; while true; do ./Run analyze --env ${ENV} --limit ${ANALYZE_LIMIT:-25}; sleep ${ANALYZE_SLEEP:-20}; done"
]
deploy:
resources:
limits:
memory: 4GB
restart_policy:
max_attempts: 5
networks:
- backend
Would your theory still apply?
@finestructure Even more so, actually, since by default a Command
doesn't run on any event loop (and an AsyncCommand
will be running on a Concurrency task). In either case, you can still use the extension
I suggested, but at the entry point to your command, do:
let eventLoop = context.application.eventLoopGroup.any())
let db = context.application.db(.psql, on: eventLoop)
Then make sure to use db
everywhere (and use eventLoop
in any place you might need to refer to an event loop, if any). The goal is the same: Make sure you do everything that's related to an event loop (versus a Concurrency task context etc.) on the same event loop.
This has been resolved by other work.
This is a follow-up of a Discord discussion, FYI @0xTim !
We're seeing the following error in one of our Vapor batch jobs every once in a while (a few times per month, maybe once a week - hard to quantify precisely):
When this message occurs, the job HANGS - i.e. it locks up and does not terminate, preventing all further processing.
In normal operation the job spins up and typically processes for 3-6 seconds, running a number of queries against the Github APIs (we're processing batches of 25 Swift packages, updating Github metadata in the Swift Package Index).
I've done a little research and there seem to be a couple of cases where this can happen:
1) too many db connections 2) an error handling problem in networking code
I suspect it's 2). Last I checked we're not exhausing our db connections and our db utilization across our two envs is <15% and <10%.
I believe this issue first started happening after moving to async/await. This may be due to how we're launching async processes from a
Command
via our ownAsyncCommand
. (We previously used a semaphore-based implementation which also encountered hangs.)Our logs show no unusual activity around the error, although we've not yet had it happen with
DEBUG
logging enabled.We're also tracking this issue here: https://github.com/SwiftPackageIndex/SwiftPackageIndex-Server/issues/2227, and I'll be adding more observations as they happen there.