Investigate performance issues introduced by `9db3e05`

Robin5605 commented 3 months ago

This script runs sucessfully prior to https://github.com/vipyrsec/dragonfly-mainframe/commit/9db3e05e6a728abd48d460eae13cda3d2545c5cf

async def do_job(client: httpx.AsyncClient):
    # get a job
    res = await client.post("/jobs")
    j = res.json()[0]
    name, version = j["name"], j["version"]

    # submit results
    payload = {"name": name, "version": version, "reason": "Package too large"}
    res = await client.put("/package", json=payload)

async def main():
    client = httpx.AsyncClient(base_url="http://localhost:8000")

    async with asyncio.TaskGroup() as tg:
        for _ in range(100):
            tg.create_task(do_job(client))

However, running it with https://github.com/vipyrsec/dragonfly-mainframe/commit/9db3e05e6a728abd48d460eae13cda3d2545c5cf checked out results in SQLAlchemy throwing a QueuePool limit of size 5 overflow 17 reached, connection timed out, timeout 30.00 error (background)

jonathan-d-zhang commented 3 months ago

we get a deadlock caused by fastapi scheduling the dependency cleanup in the threadpool shared by endpoint functions. the reason we deadlock is that we have a limited number of database connections, c, a limited number of workers, w > c. when we get w incoming requests, all workers will work on the requests. c of those workers will run queries, the rest will block waiting for connections. when the first c requests end, they need to wait for their dependencies to get cleaned up. this adds an additional c tasks to the queue, for closing the connection. however, all the tasks in the queue are sleeping either waiting for a connection to be closed or waiting for a connection. therefore, deadlock.

This was discussed in this issue: https://github.com/tiangolo/fastapi/discussions/6628, leading to a PR: https://github.com/tiangolo/fastapi/pull/5122, released in version 0.82.0, https://fastapi.tiangolo.com/release-notes/#0820. Though, we are on version 0.110.0.

As a stop-gap solution, we can simply manually close sessions ourselves, either using context managers or just .closeing them.

Robin5605 commented 3 months ago

Further investigation in the Discord server yields a peculiar issue relating to deadlocks. There are a handful of discussions, issues, and PRs that are related to this:

These being merged and closed implies that this issue was fixed - however this does not seem to be the case (at least for us). The issue, quite intuitive in retrospect, boils down to the following: we get some number of requests coming in (let's say 50), which is greater than the amount of internal thread pool workers (let's say 40 worker threads). All 40 worker threads start working on their requests, Only some of these will run queries (let's say we have 5 database connections). The rest of the 35 worker threads will block for a connection to become available. Once the first 5 requests are done with their queries, they schedule 5 tasks to close their sessions. However, since all workers in the threadpool are currently busy (either waiting for their cleanup tasks to run, or waiting for a database connection), the cleanup tasks cannot run - therefore, we deadlock.

The most straightforward solution is to not rely on FastAPI's dependency injection system for database connections, and simply make sessions ourselves when we need it and close it at the end of the path operation. This should ensure that we don't create a whole new task just for closing sessions that could possibly deadlock us.

vipyrsec / dragonfly-mainframe

Investigate performance issues introduced by `9db3e05` #235