Open dkliban opened 1 month ago
When I look at this, here's the situation I see. The DB itself is not fully loaded, it's at 33% ish so the DB isn't the rate limiting component here. Also the API workers are timing out due waiting a really long time for an advisory lock.
So what that means to me is I believe we're running into the architectural limit of task insertion into the db (or maybe also task handling too?). We have 48 workers running in this system so that's a lot of workers, but also we may even need more.
This is an interesting problem because we can't increase throughput or capacity by making more hardware resources available. This can only be solved algorithmically I think. The idea would be (somehow?) to make the acquisition of locks less contentious.
Can you identify whether this is related to the unblocked_at change? Maybe we are seeing other table locks slowing the insertion down and so the advisory lock (being a turnstile for ensuring monotonous pulp_created values at all cost) would be slowed down externally. Or maybe we really just hit the limit of that special bottleneck. Adding more resources is certainly not improving the situation here. A first idea (under the assumption that concurrent tasks rarely touch the same resources) could be to create some sort of bloom filter on the tasks resources and spread the current single advisory lock into 8. Now only tasks having an overlap in the 8 resource identifier buckets need to wait on each others locks.
OTOH It might be worth rerunning the tests with the new indices we just added on the tasks table.
Thanks for the thoughtful comments.
Yes, let's try to rerun the tests again after our installation is upgraded to that released version. Can you let us know what version that is whenever that is known?
It merged this week.
Here is another screenshot from the RDS management console.
I currently have 50 concurrent threads creating a remote, a repo, and syncing the repo. Here is what the top 10 queries are.
I have 24 workers running right now.
The green color represents CPU wait time. AWS is suggesting that the instance be upgraded to one with more CPU resources. I agree with their assessment.
Tell me, is this a reason to say we can close this issue?
I opened this issue when I had 48 workers running. Right now I am using 24 workers to get around the advisory lock issue.
I believe if I increase to 48 again, we will see this problem. Let's keep it open at least until I try 48 workers again.
Version 3.52.0
Describe the bug I have 10 API pods each running 20 gunicorn workers. I am submitting a lot of sync tasks and eventually I have some API workers timeout and the following traceback is emitted:
Here is a screenshot of the db load: