Closed ktosiek closed 4 years ago
So far the focus has been on compatibility, correctness, and being up to date with GraphQL.js.
The next step will be to improve performance. There is probably a lot of room for improvement. I'd like to keep the code simple and very close to the GraphQL.js implementation though, in order to make it easier to stay in sync with changes over there, and not move us into a very performant, but inscrutable and unmaintainable dead end.
Maybe we can kind of inline some of the isawaitables() or split some of the code paths for sync/async to avoid such checks.
But before we start the optimization, we need good benchmarks tools. Your script would be a starting point. My idea is to port the existing benchmarks from GraphQL.js and maybe add some more Python specific ones if needed (see issue #3). I think it makes sense t to work on that first, and at least have some basic performance checks before changing the code.
It's on my list, just need some free time to work on this. Feel free to send contributions as PRs as well.
Update: We now have nice benchmarks and can continue working on performance.
I'd like to keep the code simple and very close to the GraphQL.js implementation though, in order to make it easier to stay in sync with changes over there, and not move us into a very performant, but inscrutable and unmaintainable dead end.
Nice job so far on this project, thanks for all your efforts. I worry that keeping the implementation similar between two very different languages and interpreters is going to keep this project from being scalable. We're using this for a new project, and are considering returning to DRF because for relatively small calls we're struggling to get latencies below 300ms (profiling shows nearly all of the time is spent inside graphql-core). PyPy reduces this by about half, but that means backporting a few repos to 3.6 which I did to test out, but am not keen to maintain especially with 3.8 just around the corner.
If it really makes and big difference necessary we certainly can implement some parts that are crucial regarding performance differently. But actually JS and Python are not that different, and as I wrote above, I believe there is still a lot of room for improvement while keeping the implementation similar. Anybody who wants to work on this: Feel free to send PRs and suggestions.
While I said JS and Python are not that different, there is one difference that plays a role here, namely that in JS you can await any expression, but in Python you can await only awaitables. That's why we need so many isawaitable calls and have nested async functions which is not very efficient. Maybe we can find a better solution.
Thanks @Cito indeed my point about Python vs. JS was that Python has been sync from day one with async bolted on whereas JS has been the other way around. As @ktosiek has done a very good job demonstrating, the isawaitable calls are very expensive and are only necessary due to the differences you mention.
@ktosiek Which versions of Python have you tried this on? I found huge improvements from Python 3.6 to Python 3.8 (it seems like 3.8 isawaitable()
has almost 300% the performance of 3.6).
Thanks for the heads up, I'll try on 3.8 too. I've run my experiments on 3.6 and 3.7.
Are you sure your 3.6 was compiled with PGO? It made a huge difference for me when I've installed 3.7 from pyenv (which disabled PGO to get better compilation times).
śr., 20 lis 2019, 23:03 użytkownik Loris Zinsou notifications@github.com napisał:
@ktosiek https://github.com/ktosiek Which versions of Python have you tried this on? I found huge improvements from Python 3.6 to Python 3.8 (it seems like 3.8 isawaitable() has almost 300% the performance of 3.6).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graphql-python/graphql-core-next/issues/54?email_source=notifications&email_token=AACD37NQ3ESXMGIUB7SSCLDQUWXZVA5CNFSM4ITOIKP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEVOFSY#issuecomment-556458699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACD37MRKKD3SBJ2XREOPSDQUWXZVANCNFSM4ITOIKPQ .
@Hellzed wow, I've confirmed your tests. There's a big difference between 3.6 and 3.7, but 3.8 is even faster!
I've looked at the test_execute_introspection_query
benchmark, and it went from ~756ms on 3.6 to ~265ms on 3.8. This is so huge isawaitable()
might not be the bottleneck anymore, I'll need new flamegraphs now :-)
version | time |
---|---|
Python 3.8.0 (default, Nov 21 2019, 20:31:38) | 370 nsec per loop |
Python 3.7.5rc1 (default, Oct 8 2019, 16:47:45) | 444 nsec per loop |
Python 3.6.9 (default, Nov 21 2019, 21:00:04) | 1.07 usec per loop |
The benchmark:
$ python3 -VV
Python 3.6.9 (default, Nov 21 2019, 21:00:04)
[GCC 9.2.1 20191008]
$ python3 -mtimeit -s $'from inspect import isawaitable \nclass A: pass \na = A()' 'isawaitable(a)'
1000000 loops, best of 3: 1.07 usec per loop
@ktosiek I used stock Ubuntu 19.10 python packages, so I'm not sure which compile flags were used.
If you intend to make a GraphQL service built on graphql-core-3 faster, I would recommend to target the parsing and validation steps, and make sure your clients always use variables for any changing value in the query.
This means instead of using graphql.graphql
, you should use graphql.parse
, graphql.validate
then only graphql.execute
separately, and introduce a caching step (with Redis or memcache for example) in between. This would also allow you to leverage libgraphqlparser if you're comfortable with CFFI or cTypes.
@Hellzed this issue started as a result of digging into cases like https://github.com/graphql-python/graphene/issues/268 - with many objects returned by one API call. Caching the parsed query would help, but the bulk of the time is spent in execution.
@ktosiek I actually wondered about the use of isawaitable()
when I saw it comes from the inspect
module, and remembered some old advice: "don't use inspect, it's slow". Never needed to verify that before.
How does Python 3.8 fare when running both Graphene v2 and Graphene v3 benchmarks? Does it just make both faster, or does it bring v3 closer to v2?
Caching the parsing and validation definitely helps, but the bulk of the time (especially for larger objects) is spent in execution. The best approach would be to introduce either an AOT or JIT compiler like https://github.com/zalando-incubator/graphql-jit that generated runtime code for optimal execution.
I've just run this simple benchmark against latest v2 and v3 branches and we've still had a performance regression of nearly 30% (18.5 sec vs 24.3 sec)
@qeternity Can confirm it's currently still 30% even with Python 3.8.
I can also confirm that the main problem are still the isawaitable
calls, particularly in the complete_value_catching_error
method. And in the isawaitable
calls, the isinstance
check for collections.abc.Awaitable
is the biggest time waster.
When I replaced that isinstance
check in the inspect
module with a simple hasattr(object, '__await__')
, the benchmark ran faster than with the v2 branch.
Maybe we can live with such a "simplified" isawaitable method? It's currently unclear to me, why and when it does not suffice to simply check for the __await__
attribute so that an instance check against Abc.Awaitable would be necessary. Any Python gurus here who can shed light on this, or who has some time to discuss this on the Python mailing list and report back?
Note: We must keep in mind that the above benchmark is only testing the path where all resolvers are synchronous. If we start to optimize, we must also measure with a benchmark with asynchronous code in order to not deteriorate the performance for that case.
Another idea: Maybe we should add an isawaitable
parameter to the execute
function. By default we could use the simplified, faster isawaitable check mentioned above, and maybe it could even be set to a function that always returns False when execute
is called by the graphql_sync
function to make that path even faster.
There's also this which I opened some time ago only to fall on deaf ears: https://github.com/syrusakbary/promise/issues/82
@qeternity This issue tracker only covers GraphQL-core v3. And please understand that these are all projects that various people work on for free in their spare time. You can't expect people to work on issues - they may have currently other priorities in life. The only way to move forward is to contribute by sending PRs, become a maintainer of stalled projects and motivating other people.
@Cito thanks, I absolutely don't expect anything. I have spent an incredible amount of time profiling and benchmarking this. With respect to scope of issue tracker, the OP benchmark is graphene, not graphql-core. I wanted to make sure we were all aware of performance regression in libraries that graphene depends on so as not to chase ghosts in this issue tracker.
@qeternity Graphene is only a wrapper around the GraphQL-core types. The query execution and the heavy lifting happens in GraphQL-core, so that's in fact the right place to optimize execution performance. The promise library mentioned above is only used in GraphQL-core v2 (Graphene v2) which is considered deprecated, that's why you don't see much activity there any more.
@Cito yep, I'm aware. But if this delta between graphene v2 and v3 is closing (which is ultimately what we're benchmarking here) due to regressions in graphene v2 (vis-a-vis Promises regressions) instead of improvements in v3, I presumed that was relevant to our discussion. But your point is well taken.
@qeternity Ah ok now I understand your point. You were actually saying you observed a v2 regression in the same benchmark we are discussing here, and since we are comparing with v2 as our base line, we should be aware of that. That's in fact relevant to our discussion.
It's interesting that you observed an update of the promises lib to affect this (synchronous) benchmark. I had a quick look - the main thing that has been changed in the promises lib update is something with caching, fixing an out-of-memory error. Maybe it's just slower because of the additional garbage collection. Benchmarking is not easy. This benchmark only measures the time in this one special case, not "the" performance in general.
I have now implemented the ideas mentioned above - this improves the performance considerably. The benchmark with v3.1.0b2 is now running faster than with v2. Please let me know how it works for you and whether we can close this issue.
@Cito Just had a look over the commit...nice, looks great! Not had a chance to run it yet, but will sit down with it tonight. I already had put aside some weekend quarantine time to start implementing some more aggressive hotpaths.
I can confirm it runs faster than v2, tested on Python 3.7 and 3.8. I think this issue is solved. Thank you, this will help a lot with migrating to v3!
Just to add on here - when using async
, the performance gains appear to be quite large due to the elimination of promises. This example highlights the performance gains pretty as it is returning a large of number of results (5x iterations).
Here is graphl-core/graphene 2:
ncalls tottime percall cumtime percall filename:lineno(function)
210 0.407 0.002 49.608 0.236 /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py:1675(_run_once)
180304 0.208 0.000 46.979 0.000 /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/events.py:86(_run)
180304 0.329 0.000 46.771 0.000 {method 'run' of 'Context' objects}
180060 0.097 0.000 25.413 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:478(resolve)
315125/180065 0.564 0.000 25.316 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:189(_resolve_callback)
90035 0.084 0.000 25.052 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:841(handle_future_result)
405130/180050 0.625 0.000 24.412 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:229(_fulfill)
180045/90030 0.107 0.000 24.067 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/async_.py:54(settle_promises)
180045/90030 0.219 0.000 24.012 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/async_.py:36(_async_settle_promise)
180050/90035 0.115 0.000 23.858 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/async_.py:127(queue_tick)
90035 0.045 0.000 23.766 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/schedulers/immediate.py:9(call)
90035 0.175 0.000 23.722 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/async_.py:119(drain_queues)
180070 0.661 0.000 23.516 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/async_.py:71(drain_queue)
315105/180045 0.403 0.000 22.259 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:457(_settle_promises)
315105/180045 0.177 0.000 21.887 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:334(_settle_promise0)
315110/180050 0.473 0.000 21.781 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:306(_settle_promise)
135065/105040 0.162 0.000 21.180 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:345(_settle_promise_from_handler)
135065 0.111 0.000 19.625 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:84(try_catch)
90025 0.085 0.000 19.430 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/executor.py:531(<lambda>)
225065/90030 0.863 0.000 19.345 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/executor.py:500(complete_value)
45010 0.061 0.000 17.934 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/executor.py:700(complete_object_value)
45015 0.265 0.000 17.727 0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/executor.py:230(execute_fields)
10 0.000 0.000 17.025 1.703
... user code ...
vs 3:
ncalls tottime percall cumtime percall filename:lineno(function)
175 0.049 0.000 15.466 0.088 /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py:1675(_run_once)
270 0.000 0.000 14.621 0.054 /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/events.py:86(_run)
270 0.001 0.000 14.621 0.054 {method 'run' of 'Context' objects}
... user code ...
45015/10 0.141 0.000 2.268 0.227 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:361(execute_fields)
90025/15 0.223 0.000 2.268 0.151 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:522(resolve_field)
10 0.000 0.000 2.268 0.227 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:630(await_completed)
10 0.000 0.000 2.268 0.227 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:615(await_result)
135040/5 0.260 0.000 2.267 0.453 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:660(complete_value)
45010/5 0.024 0.000 2.267 0.453 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:899(complete_object_value)
45010/5 0.045 0.000 2.267 0.453 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:934(collect_and_execute_subfields)
105025/15 0.094 0.000 2.267 0.151 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:599(complete_value_catching_error)
5 0.024 0.005 2.266 0.453 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:744(complete_list_value)
It does appear that document validation is 2x slower in the newer version of core, but I'll open a separate issue. Nice work!
@ewhauser Thanks for the feedback. Yes, it would be good to open a new issue for performance improvements in validation. Maybe you can run a profiler and identify some bottlenecks already.
I've run a very simplistic benchmark, just returning a long list of single-field objects. It seems graphql-core-next is 2.5x slower than graphql-core: https://gist.github.com/ktosiek/849e8c7de8852c2df1df5af8ac193287
Looking at flamegraphs, I see isawaitable is used a lot, and it's a pretty slow function. Would it be possible to pass raw results around more? It seems resolve_field_value_or_error and complete_value_catching_error are the main offenders here.