API worker timesout waiting on an advisory lock to dispatch a task

dkliban commented 1 month ago

Version 3.52.0

Describe the bug I have 10 API pods each running 20 gunicorn workers. I am submitting a lot of sync tasks and eventually I have some API workers timeout and the following traceback is emitted:

[2024-05-15 19:41:55 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:14)
[2024-05-15 19:41:55 +0000] [14] [ERROR] Error handling request /api/pulp/default/api/v3/repositories/rpm/rpm/018f7dc4-d1fd-77e4-a00d-5c3fb16b6416/sync/
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 135, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 178, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.9/site-packages/opentelemetry/instrumentation/wsgi/__init__.py", line 589, in __call__
    iterable = self.wsgi(environ, start_response)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/wsgi.py", line 124, in __call__
    response = self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/base.py", line 140, in get_response
    response = self._middleware_chain(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django_guid/middleware.py", line 79, in middleware
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/whitenoise/middleware.py", line 124, in __call__
    return self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/utils/deprecation.py", line 134, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/pulpcore/middleware.py", line 24, in __call__
    response = self.get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.9/site-packages/django/core/handlers/base.py", line 197, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/views/decorators/csrf.py", line 56, in wrapper_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/rest_framework/viewsets.py", line 124, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/rest_framework/views.py", line 506, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/pulp_rpm/app/viewsets/repository.py", line 237, in sync
    result = dispatch(
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/tasks.py", line 168, in dispatch
    cursor.execute("SELECT pg_advisory_xact_lock(%s, %s)", [0, TASK_DISPATCH_LOCK])
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/psycopg/cursor.py", line 728, in execute
    self._conn.wait(
  File "/usr/local/lib/python3.9/site-packages/psycopg/connection.py", line 969, in wait
    return waiting.wait(gen, self.pgconn.socket, timeout=timeout)
  File "psycopg_binary/_psycopg/waiting.pyx", line 193, in psycopg_binary._psycopg.wait_c
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
    sys.exit(1)
SystemExit: 1
('pulp [-]: ::ffff:10.131.8.12 - - [15/May/2024:19:41:55 +0000] "POST /api/pulp/default/api/v3/repositories/rpm/rpm/018f7dc4-d1fd-77e4-a00d-5c3fb16b6416/sync/ HTTP/1.1" 500 0 "-" "-"',)
[2024-05-15 19:41:56 +0000] [14] [INFO] Worker exiting (pid: 14)
[2024-05-15 19:41:56 +0000] [1] [ERROR] Worker (pid:14) was sent SIGKILL! Perhaps out of memory?
[2024-05-15 19:41:56 +0000] [213] [INFO] Booting worker with pid: 213

Here is a screenshot of the db load:

Screenshot from 2024-05-15 16-13-36

bmbouter commented 1 month ago

When I look at this, here's the situation I see. The DB itself is not fully loaded, it's at 33% ish so the DB isn't the rate limiting component here. Also the API workers are timing out due waiting a really long time for an advisory lock.

So what that means to me is I believe we're running into the architectural limit of task insertion into the db (or maybe also task handling too?). We have 48 workers running in this system so that's a lot of workers, but also we may even need more.

This is an interesting problem because we can't increase throughput or capacity by making more hardware resources available. This can only be solved algorithmically I think. The idea would be (somehow?) to make the acquisition of locks less contentious.

mdellweg commented 1 month ago

Can you identify whether this is related to the unblocked_at change? Maybe we are seeing other table locks slowing the insertion down and so the advisory lock (being a turnstile for ensuring monotonous pulp_created values at all cost) would be slowed down externally. Or maybe we really just hit the limit of that special bottleneck. Adding more resources is certainly not improving the situation here. A first idea (under the assumption that concurrent tasks rarely touch the same resources) could be to create some sort of bloom filter on the tasks resources and spread the current single advisory lock into 8. Now only tasks having an overlap in the 8 resource identifier buckets need to wait on each others locks.

mdellweg commented 1 month ago

OTOH It might be worth rerunning the tests with the new indices we just added on the tasks table.

bmbouter commented 1 month ago

Thanks for the thoughtful comments.

Yes, let's try to rerun the tests again after our installation is upgraded to that released version. Can you let us know what version that is whenever that is known?

mdellweg commented 1 month ago

It merged this week.

dkliban commented 1 month ago

Here is another screenshot from the RDS management console.

dkliban commented 1 month ago

I currently have 50 concurrent threads creating a remote, a repo, and syncing the repo. Here is what the top 10 queries are.

I have 24 workers running right now.

The green color represents CPU wait time. AWS is suggesting that the instance be upgraded to one with more CPU resources. I agree with their assessment.

mdellweg commented 1 month ago

Tell me, is this a reason to say we can close this issue?

dkliban commented 1 month ago

I opened this issue when I had 48 workers running. Right now I am using 24 workers to get around the advisory lock issue.

I believe if I increase to 48 again, we will see this problem. Let's keep it open at least until I try 48 workers again.

pulp / pulpcore

API worker timesout waiting on an advisory lock to dispatch a task #5390