Significant performance hit when using async resolvers

kevinvalk commented 1 year ago

During development (Starlette + Ariadne) of our product I noticed significant performance degradation when GraphQL responses got significant long (1000+ entities in a list). I started profiling and drilling into the issue and I pinpointed it to async resolvers. Whenever a resolver is async and it is called a lot (100.000) you can see significant slowdowns 4x-7x, even if there is nothing async to it.

The question that I ended up with, is this a limitation of Python asyncio or how the results are gathered for async fields in graphql execute?

Any insight/help is greatly appreciated as we really need more performance and changing to sync is not really an option (nor is rewriting it to another language) 😭

import asyncio
from dataclasses import dataclass

from graphql import (
    GraphQLField,
    GraphQLList,
    GraphQLObjectType,
    GraphQLSchema,
    GraphQLString,
    graphql,
)

# On a MacBook Pro M1 Max this takes around 44 seconds in "async" case
# and "only" 7 seconds when async field is removed from the query
TOTAL_PEOPLE = 1000000

@dataclass
class Person:
    firstName: str
    lastName: str

def resolve_people(*_):
    return [Person(f"Jane{i}", "Do") for i in range(0, TOTAL_PEOPLE)]

def resolve_fullname(person: Person, *_):
    return f"{person.firstName} {person.lastName}"

async def async_resolve_fullname(person: Person, *_):
    return f"{person.firstName} {person.lastName}"

PersonType = GraphQLObjectType(
    "Person",
    {
        "firstName": GraphQLField(GraphQLString),
        "lastName": GraphQLField(GraphQLString),
        "fullName": GraphQLField(GraphQLString, resolve=resolve_fullname),
        "asyncFullName": GraphQLField(GraphQLString, resolve=async_resolve_fullname),
    },
)

schema = GraphQLSchema(
    query=GraphQLObjectType(
        name="RootQueryType",
        fields={
            "people": GraphQLField(GraphQLList(PersonType), resolve=resolve_people)
        },
    )
)

async def main():
    result = await graphql(
        schema,
        """#graphql
        query {
            people {
                firstName
                lastName
                fullName
                asyncFullName # THIS IS THE SLOW PART
            }
        }
    """,
    )
    assert len(result.data["people"]) == TOTAL_PEOPLE

def run(callable, is_profile: bool = False):
    if not is_profile:
        return asyncio.run(callable())

    import yappi

    yappi.set_clock_type("WALL")
    with yappi.run():
        asyncio.run(callable())

    func_stats = yappi.get_func_stats()

    func_stats.save("callgrind.func_stats", "callgrind")
    with open("func_stats.txt", "w") as file:
        func_stats.print_all(
            file,
            {
                # We increase the column widths significantly
                0: ("name", 120),
                1: ("ncall", 10),
                2: ("tsub", 12),
                3: ("ttot", 12),
                4: ("tavg", 12),
            },
        )

if __name__ == "__main__":
    run(main, is_profile=False)

Versions:

macOS: 13.2
python 3.11
graphql-core 3.2.3
yappi 1.4.0

Cito commented 1 year ago

Thanks, will definitely look into that. For clarification: When you say degradation, then in comparison to what? Smaller results, sync resolvers, or earlier versions of GraphQL-core or Python?

kevinvalk commented 1 year ago

Compared with sync resolver.

The field fullName is connected to def resolve_fullname and asyncFullName to async def async_resolve_fullname. There is 4-7x performance difference (sync is faster) between those two resolvers with the only difference being async method.

Cito commented 1 year ago

Ok, thanks for clarification. I think #142 and #189 are related issues.

JCatrielLopez commented 1 year ago

Just to clarify, on #189 we didn't use async resolvers

kevinvalk commented 1 year ago

Did some performance tests and it seems asyncio.gather is HORRIBLY slow for cpu bound futures when you have many. https://perfpy.com/240

So based on these results I created the following monkey patch that gives us in our current production code 2x speed improvement:

execute = import_module(".execution.execute", "graphql")
async def serial_gather(*futures):
    return [await future for future in futures]
execute.gather = serial_gather

NOTE: These tests are based on a CPU bound "resolver" so any optimizations have to be checked to a more normal workload with maybe a mix of CPU bound and IO bound resolvers.

That said, as @JCatrielLopez indicated, there is also slowdowns in sync code so I think there is a lot of performance left on the table. @Cito how could we tackle investigations into increasing performance of graphql-python library?

jkimbo commented 1 year ago

@kevinvalk thats a really good find and it would definitely help the cases with simple async resolvers. However it's worth pointing out that as soon as your resolvers are actually doing anything that might take some time (i.e. HTTP requests or file access) using the serial gather method will slow things down because the functions aren't running in parallel. I've written benchmark to show that (the async function sleeps for 50 milliseconds on every 1000th iteration): https://perfpy.com/242

There might be an argument to allowing developers to configure how async results are gathered but why are you using async functions for purely synchronous work anyway @kevinvalk ? Async functions will always have an overhead compared to sync functions.

Cito commented 1 year ago

Thank you all for your input. @kevinvalk The way that gather works in Python is that it creates an async task for each of the passed coroutines. If the overhead of running asyncio.create_task is high compared to the runtime of the coroutine itself, in this case creating a single random value with random.randint, then of course you will get a very poor performance. Normally you trade the overhead of creating a task against the advantage of running code concurrently, but in this case, since the code is pure CPU code that is blocking the event loop, you do not even gain anything from it as @jkimbo pointed out. With other words, the benchmarks just shows how much slower asyncio.create_task is compared to random.randint.

However, I think we all agree that there is room for improvement for supporting different strategies, maybe in form of optimization hints for the execution algorithm. At least, it should be possible to specify whether a serial or parallel execution strategy is used. Currently, only mutations are resolved serially, and you have no way to force queries to be executed serially (just like in GraphQl.js).

Note that GraphQL-core already detects when all (sub)field resolvers are synchronous and does not try to gather anything in this case. As @jkimbo wrote, using sync resolvers is therefore the simplest solution in such cases.

jkimbo commented 1 year ago

@Cito One optimisation that we could make is to never use asyncio.gather when there is only 1 coroutine since it wouldn't provide any benefit. It would help the example in this issue a lot but I'm not sure how helpful it would be for most real use cases. What do you think?

Cito commented 1 year ago

@jkimbo Yes, that's a simple thing I can implement. The additional check is probably cheap compared to the overhead of running gather.

dacevedo12 commented 1 year ago

Hey there, I've noticed something related to this when using graphql middlewares with async execution:

kevinvalk commented 1 year ago

@Cito I monkey patched your change into our product and it (as you also expected) does not do anything :'(

The synthetical nature of the problem is not really so synthetical as we are using https://github.com/syrusakbary/aiodataloader for the GraphQL N+1 problem (and way more efficient database queries). This means that only a few times we will have an actual database lookup (IO bound), and all other times we have direct cache lookups (CPU bound) that just take the objects from the cache. However, all resolves have to be async to get this to work.

So I was wondering if maybe my approach is wrong or if I am missing something else. Any ideas are more than welcome!

EDIT: After posting this I used py-spy (great profiling tool I recently discovered) to investigate and quite quickly found another usage of gather in https://github.com/syrusakbary/aiodataloader, by replacing that with serial await as well, my requests are now around 500ms (see https://github.com/syrusakbary/aiodataloader/issues/40)!

Note that I still need serial await in GraphQL core as well, if I undo the monkey patch I go from 500ms -> 2s per request.

execute = import_module(".execution.execute", "graphql")
aiodataloader = import_module("aiodataloader")

async def serial_gather(*futures: Awaitable[Any]):
    return [await future for future in futures]

aiodataloader.gather = execute.gather = serial_gather  # type: ignore

Cito commented 1 year ago

@kevinvalk thanks for the feedback. In fact if you're using a lot of caching and the work starts to become CPU bound then the price for async can be heavy and diminish the advantage of caching. See also Wall vs CPU time, or the cost of asyncio Tasks and CPU Bound code runs 3x-5x slower with asyncio.

I currently do not have a good idea to solve or mitigate this problem and honestly speaking I don't want to spend too much time on optimizations - the main goal of GraphQL-core is to be a faithful Python port of the reference implementation GraphQL-js and stay up to date with its latests version in functionality, not to be particularly fast. If performance is needed, then there is only so much you can do with pure Python.

Btw, there is also the open issue #166 - I wonder if using anyio with trio would help, but I fear it may become even slower through the use of anyio as another layer.

QSHolzner commented 1 year ago

We had exactly the same problem and found out that most of the time is spend in garbage collection (gc_collect_main). If I run the above program from @kevinvalk with py-spy and the native option (py-spy top --native -- python demo.py) I get the following output.

)otal Samples 11000
Total Samples 11100
GIL: 100.00%, Active: 100.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                
  0.00%   0.00%   44.14s    45.11s   gc_collect_main (libpython3.10.so.1.0)
  0.00%   0.00%    6.00s    53.87s   execute_field (graphql/execution/execute.py)
  0.00%   0.00%    5.42s    53.87s   execute_fields (graphql/execution/execute.py)
  0.00%   0.00%    3.50s    53.95s   complete_value (graphql/execution/execute.py)
  0.00%   0.00%    2.88s     2.88s   is_awaitable (graphql/pyutils/is_awaitable.py)
 11.00% 100.00%    2.57s    111.0s   _run_once (asyncio/base_events.py)
  0.00%   0.00%    2.47s     2.72s   get_argument_values (graphql/execution/values.py)
  0.00%   0.00%    2.44s     2.61s   __init__ (asyncio/events.py)
  7.00%  40.00%    2.34s    107.0s   _run (asyncio/events.py)
  0.00%   0.00%    2.30s     2.44s   <lambda> (<string>)
  0.00%   0.00%    2.29s    19.51s   gather (asyncio/tasks.py)
  0.00%   0.00%    2.25s     3.45s   build_resolve_info (graphql/execution/execute.py)
  0.00%   0.00%    1.99s    10.52s   _call_soon (asyncio/base_events.py)
  0.00%   0.00%    1.98s    10.87s   get_results (graphql/execution/execute.py)
  0.00%   0.00%    1.89s     1.89s   getpid (libc-2.31.so)
  0.00%   0.00%    1.79s    52.85s   complete_list_value (graphql/execution/execute.py)
  0.00%   0.00%    1.26s    11.94s   call_soon (asyncio/base_events.py)
  0.00%   0.00%    1.16s     1.18s   get_field_def (graphql/execution/execute.py)

If I disable garbage collection the program runs in half of the time.

if __name__ == "__main__":
    import gc

    gc.disable()
    run(main, is_profile=False)

 )otal Samples 6600
Total Samples 6700
GIL: 100.00%, Active: 100.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                
  0.00%   0.00%    5.86s    34.10s   execute_fields (graphql/execution/execute.py)
  0.00%   0.00%    5.80s    34.08s   execute_field (graphql/execution/execute.py)
  0.00%   0.00%    3.61s    34.58s   complete_value (graphql/execution/execute.py)
  0.00%   0.00%    2.80s     2.80s   is_awaitable (graphql/pyutils/is_awaitable.py)
  5.00% 100.00%    2.66s    66.96s   _run_once (asyncio/base_events.py)
  0.00%   0.00%    2.65s    13.25s   gather (asyncio/tasks.py)
  4.00%  50.00%    2.38s    63.09s   _run (asyncio/events.py)
  0.00%   0.00%    2.32s     3.55s   build_resolve_info (graphql/execution/execute.py)
  0.00%   0.00%    2.17s     2.32s   __init__ (asyncio/events.py)
  0.00%   0.00%    2.16s     2.26s   <lambda> (<string>)
  0.00%   0.00%    2.15s    10.13s   get_results (graphql/execution/execute.py)
  0.00%   0.00%    2.13s     2.28s   get_argument_values (graphql/execution/values.py)
  0.00%   0.00%    2.08s     4.40s   _call_soon (asyncio/base_events.py)
  0.00%   0.00%    1.83s    33.48s   complete_list_value (graphql/execution/execute.py)
  0.00%   0.00%    1.55s     1.55s   getpid (libc-2.31.so)
  0.00%   0.00%    1.28s     5.85s   call_soon (asyncio/base_events.py)
  0.00%  20.00%    1.27s    57.62s   task_step_impl (_asynciomodule.c)
  0.00%   0.00%    1.27s     1.30s   get_field_def (graphql/execution/execute.py)

As @cito already mentioned asynction.create_task create a lot of taks objects stored in a WeakSet which have later to be collected by the python garbage collector.

Perhaps this information helps to find some solution to this problem.

Cito commented 1 year ago

Thanks for the valuable input @QSHolzner

graphql-python / graphql-core

Significant performance hit when using async resolvers #190