Open kevinvalk opened 1 year ago
Thanks, will definitely look into that. For clarification: When you say degradation, then in comparison to what? Smaller results, sync resolvers, or earlier versions of GraphQL-core or Python?
Compared with sync resolver.
The field fullName
is connected to def resolve_fullname
and asyncFullName
to async def async_resolve_fullname
. There is 4-7x performance difference (sync is faster) between those two resolvers with the only difference being async
method.
Ok, thanks for clarification. I think #142 and #189 are related issues.
Just to clarify, on #189 we didn't use async resolvers
Did some performance tests and it seems asyncio.gather
is HORRIBLY slow for cpu bound futures when you have many.
https://perfpy.com/240
So based on these results I created the following monkey patch that gives us in our current production code 2x speed improvement:
execute = import_module(".execution.execute", "graphql")
async def serial_gather(*futures):
return [await future for future in futures]
execute.gather = serial_gather
NOTE: These tests are based on a CPU bound "resolver" so any optimizations have to be checked to a more normal workload with maybe a mix of CPU bound and IO bound resolvers.
That said, as @JCatrielLopez indicated, there is also slowdowns in sync code so I think there is a lot of performance left on the table. @Cito how could we tackle investigations into increasing performance of graphql-python
library?
@kevinvalk thats a really good find and it would definitely help the cases with simple async resolvers. However it's worth pointing out that as soon as your resolvers are actually doing anything that might take some time (i.e. HTTP requests or file access) using the serial gather method will slow things down because the functions aren't running in parallel. I've written benchmark to show that (the async function sleeps for 50 milliseconds on every 1000th iteration): https://perfpy.com/242
There might be an argument to allowing developers to configure how async results are gathered but why are you using async functions for purely synchronous work anyway @kevinvalk ? Async functions will always have an overhead compared to sync functions.
Thank you all for your input. @kevinvalk The way that gather
works in Python is that it creates an async task for each of the passed coroutines. If the overhead of running asyncio.create_task
is high compared to the runtime of the coroutine itself, in this case creating a single random value with random.randint
, then of course you will get a very poor performance. Normally you trade the overhead of creating a task against the advantage of running code concurrently, but in this case, since the code is pure CPU code that is blocking the event loop, you do not even gain anything from it as @jkimbo pointed out. With other words, the benchmarks just shows how much slower asyncio.create_task
is compared to random.randint
.
However, I think we all agree that there is room for improvement for supporting different strategies, maybe in form of optimization hints for the execution algorithm. At least, it should be possible to specify whether a serial or parallel execution strategy is used. Currently, only mutations are resolved serially, and you have no way to force queries to be executed serially (just like in GraphQl.js).
Note that GraphQL-core already detects when all (sub)field resolvers are synchronous and does not try to gather
anything in this case. As @jkimbo wrote, using sync resolvers is therefore the simplest solution in such cases.
@Cito One optimisation that we could make is to never use asyncio.gather when there is only 1 coroutine since it wouldn't provide any benefit. It would help the example in this issue a lot but I'm not sure how helpful it would be for most real use cases. What do you think?
@jkimbo Yes, that's a simple thing I can implement. The additional check is probably cheap compared to the overhead of running gather
.
Hey there, I've noticed something related to this when using graphql middlewares with async execution:
@Cito I monkey patched your change into our product and it (as you also expected) does not do anything :'(
The synthetical nature of the problem is not really so synthetical as we are using https://github.com/syrusakbary/aiodataloader for the GraphQL N+1 problem (and way more efficient database queries). This means that only a few times we will have an actual database lookup (IO bound), and all other times we have direct cache lookups (CPU bound) that just take the objects from the cache. However, all resolves have to be async to get this to work.
So I was wondering if maybe my approach is wrong or if I am missing something else. Any ideas are more than welcome!
EDIT: After posting this I used py-spy
(great profiling tool I recently discovered) to investigate and quite quickly found another usage of gather
in https://github.com/syrusakbary/aiodataloader, by replacing that with serial await as well, my requests are now around 500ms (see https://github.com/syrusakbary/aiodataloader/issues/40)!
Note that I still need serial await in GraphQL core as well, if I undo the monkey patch I go from 500ms -> 2s per request.
execute = import_module(".execution.execute", "graphql")
aiodataloader = import_module("aiodataloader")
async def serial_gather(*futures: Awaitable[Any]):
return [await future for future in futures]
aiodataloader.gather = execute.gather = serial_gather # type: ignore
@kevinvalk thanks for the feedback. In fact if you're using a lot of caching and the work starts to become CPU bound then the price for async can be heavy and diminish the advantage of caching. See also Wall vs CPU time, or the cost of asyncio Tasks and CPU Bound code runs 3x-5x slower with asyncio.
I currently do not have a good idea to solve or mitigate this problem and honestly speaking I don't want to spend too much time on optimizations - the main goal of GraphQL-core is to be a faithful Python port of the reference implementation GraphQL-js and stay up to date with its latests version in functionality, not to be particularly fast. If performance is needed, then there is only so much you can do with pure Python.
Btw, there is also the open issue #166 - I wonder if using anyio with trio would help, but I fear it may become even slower through the use of anyio as another layer.
We had exactly the same problem and found out that most of the time is spend in garbage collection (gc_collect_main).
If I run the above program from @kevinvalk with py-spy and the native option (py-spy top --native -- python demo.py
)
I get the following output.
)otal Samples 11000
Total Samples 11100
GIL: 100.00%, Active: 100.00%, Threads: 1
%Own %Total OwnTime TotalTime Function (filename)
0.00% 0.00% 44.14s 45.11s gc_collect_main (libpython3.10.so.1.0)
0.00% 0.00% 6.00s 53.87s execute_field (graphql/execution/execute.py)
0.00% 0.00% 5.42s 53.87s execute_fields (graphql/execution/execute.py)
0.00% 0.00% 3.50s 53.95s complete_value (graphql/execution/execute.py)
0.00% 0.00% 2.88s 2.88s is_awaitable (graphql/pyutils/is_awaitable.py)
11.00% 100.00% 2.57s 111.0s _run_once (asyncio/base_events.py)
0.00% 0.00% 2.47s 2.72s get_argument_values (graphql/execution/values.py)
0.00% 0.00% 2.44s 2.61s __init__ (asyncio/events.py)
7.00% 40.00% 2.34s 107.0s _run (asyncio/events.py)
0.00% 0.00% 2.30s 2.44s <lambda> (<string>)
0.00% 0.00% 2.29s 19.51s gather (asyncio/tasks.py)
0.00% 0.00% 2.25s 3.45s build_resolve_info (graphql/execution/execute.py)
0.00% 0.00% 1.99s 10.52s _call_soon (asyncio/base_events.py)
0.00% 0.00% 1.98s 10.87s get_results (graphql/execution/execute.py)
0.00% 0.00% 1.89s 1.89s getpid (libc-2.31.so)
0.00% 0.00% 1.79s 52.85s complete_list_value (graphql/execution/execute.py)
0.00% 0.00% 1.26s 11.94s call_soon (asyncio/base_events.py)
0.00% 0.00% 1.16s 1.18s get_field_def (graphql/execution/execute.py)
If I disable garbage collection the program runs in half of the time.
if __name__ == "__main__":
import gc
gc.disable()
run(main, is_profile=False)
)otal Samples 6600
Total Samples 6700
GIL: 100.00%, Active: 100.00%, Threads: 1
%Own %Total OwnTime TotalTime Function (filename)
0.00% 0.00% 5.86s 34.10s execute_fields (graphql/execution/execute.py)
0.00% 0.00% 5.80s 34.08s execute_field (graphql/execution/execute.py)
0.00% 0.00% 3.61s 34.58s complete_value (graphql/execution/execute.py)
0.00% 0.00% 2.80s 2.80s is_awaitable (graphql/pyutils/is_awaitable.py)
5.00% 100.00% 2.66s 66.96s _run_once (asyncio/base_events.py)
0.00% 0.00% 2.65s 13.25s gather (asyncio/tasks.py)
4.00% 50.00% 2.38s 63.09s _run (asyncio/events.py)
0.00% 0.00% 2.32s 3.55s build_resolve_info (graphql/execution/execute.py)
0.00% 0.00% 2.17s 2.32s __init__ (asyncio/events.py)
0.00% 0.00% 2.16s 2.26s <lambda> (<string>)
0.00% 0.00% 2.15s 10.13s get_results (graphql/execution/execute.py)
0.00% 0.00% 2.13s 2.28s get_argument_values (graphql/execution/values.py)
0.00% 0.00% 2.08s 4.40s _call_soon (asyncio/base_events.py)
0.00% 0.00% 1.83s 33.48s complete_list_value (graphql/execution/execute.py)
0.00% 0.00% 1.55s 1.55s getpid (libc-2.31.so)
0.00% 0.00% 1.28s 5.85s call_soon (asyncio/base_events.py)
0.00% 20.00% 1.27s 57.62s task_step_impl (_asynciomodule.c)
0.00% 0.00% 1.27s 1.30s get_field_def (graphql/execution/execute.py)
As @cito already mentioned asynction.create_task
create a lot of taks objects stored in a WeakSet which have later to be collected by the python garbage collector.
Perhaps this information helps to find some solution to this problem.
Thanks for the valuable input @QSHolzner
During development (Starlette + Ariadne) of our product I noticed significant performance degradation when GraphQL responses got significant long (1000+ entities in a list). I started profiling and drilling into the issue and I pinpointed it to
async
resolvers. Whenever a resolver is async and it is called a lot (100.000) you can see significant slowdowns 4x-7x, even if there is nothing async to it.The question that I ended up with, is this a limitation of Python asyncio or how the results are gathered for async fields in graphql execute?
Any insight/help is greatly appreciated as we really need more performance and changing to sync is not really an option (nor is rewriting it to another language) ðŸ˜
Versions: