graphql-python / graphql-core

A Python 3.6+ port of the GraphQL.js reference implementation of GraphQL.
MIT License
512 stars 136 forks source link

Performance of isawaitable #54

Closed ktosiek closed 4 years ago

ktosiek commented 5 years ago

I've run a very simplistic benchmark, just returning a long list of single-field objects. It seems graphql-core-next is 2.5x slower than graphql-core: https://gist.github.com/ktosiek/849e8c7de8852c2df1df5af8ac193287

Looking at flamegraphs, I see isawaitable is used a lot, and it's a pretty slow function. Would it be possible to pass raw results around more? It seems resolve_field_value_or_error and complete_value_catching_error are the main offenders here.

Cito commented 5 years ago

So far the focus has been on compatibility, correctness, and being up to date with GraphQL.js.

The next step will be to improve performance. There is probably a lot of room for improvement. I'd like to keep the code simple and very close to the GraphQL.js implementation though, in order to make it easier to stay in sync with changes over there, and not move us into a very performant, but inscrutable and unmaintainable dead end.

Maybe we can kind of inline some of the isawaitables() or split some of the code paths for sync/async to avoid such checks.

But before we start the optimization, we need good benchmarks tools. Your script would be a starting point. My idea is to port the existing benchmarks from GraphQL.js and maybe add some more Python specific ones if needed (see issue #3). I think it makes sense t to work on that first, and at least have some basic performance checks before changing the code.

It's on my list, just need some free time to work on this. Feel free to send contributions as PRs as well.

Cito commented 5 years ago

Update: We now have nice benchmarks and can continue working on performance.

qeternity commented 5 years ago

I'd like to keep the code simple and very close to the GraphQL.js implementation though, in order to make it easier to stay in sync with changes over there, and not move us into a very performant, but inscrutable and unmaintainable dead end.

Nice job so far on this project, thanks for all your efforts. I worry that keeping the implementation similar between two very different languages and interpreters is going to keep this project from being scalable. We're using this for a new project, and are considering returning to DRF because for relatively small calls we're struggling to get latencies below 300ms (profiling shows nearly all of the time is spent inside graphql-core). PyPy reduces this by about half, but that means backporting a few repos to 3.6 which I did to test out, but am not keen to maintain especially with 3.8 just around the corner.

Cito commented 5 years ago

If it really makes and big difference necessary we certainly can implement some parts that are crucial regarding performance differently. But actually JS and Python are not that different, and as I wrote above, I believe there is still a lot of room for improvement while keeping the implementation similar. Anybody who wants to work on this: Feel free to send PRs and suggestions.

Cito commented 5 years ago

While I said JS and Python are not that different, there is one difference that plays a role here, namely that in JS you can await any expression, but in Python you can await only awaitables. That's why we need so many isawaitable calls and have nested async functions which is not very efficient. Maybe we can find a better solution.

qeternity commented 5 years ago

Thanks @Cito indeed my point about Python vs. JS was that Python has been sync from day one with async bolted on whereas JS has been the other way around. As @ktosiek has done a very good job demonstrating, the isawaitable calls are very expensive and are only necessary due to the differences you mention.

Hellzed commented 4 years ago

@ktosiek Which versions of Python have you tried this on? I found huge improvements from Python 3.6 to Python 3.8 (it seems like 3.8 isawaitable() has almost 300% the performance of 3.6).

ktosiek commented 4 years ago

Thanks for the heads up, I'll try on 3.8 too. I've run my experiments on 3.6 and 3.7.

Are you sure your 3.6 was compiled with PGO? It made a huge difference for me when I've installed 3.7 from pyenv (which disabled PGO to get better compilation times).

śr., 20 lis 2019, 23:03 użytkownik Loris Zinsou notifications@github.com napisał:

@ktosiek https://github.com/ktosiek Which versions of Python have you tried this on? I found huge improvements from Python 3.6 to Python 3.8 (it seems like 3.8 isawaitable() has almost 300% the performance of 3.6).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graphql-python/graphql-core-next/issues/54?email_source=notifications&email_token=AACD37NQ3ESXMGIUB7SSCLDQUWXZVA5CNFSM4ITOIKP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEVOFSY#issuecomment-556458699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACD37MRKKD3SBJ2XREOPSDQUWXZVANCNFSM4ITOIKPQ .

ktosiek commented 4 years ago

@Hellzed wow, I've confirmed your tests. There's a big difference between 3.6 and 3.7, but 3.8 is even faster!

I've looked at the test_execute_introspection_query benchmark, and it went from ~756ms on 3.6 to ~265ms on 3.8. This is so huge isawaitable() might not be the bottleneck anymore, I'll need new flamegraphs now :-)

Benchmark results:

version time
Python 3.8.0 (default, Nov 21 2019, 20:31:38) 370 nsec per loop
Python 3.7.5rc1 (default, Oct 8 2019, 16:47:45) 444 nsec per loop
Python 3.6.9 (default, Nov 21 2019, 21:00:04) 1.07 usec per loop

The benchmark:

$ python3 -VV
Python 3.6.9 (default, Nov 21 2019, 21:00:04) 
[GCC 9.2.1 20191008]
$ python3 -mtimeit -s $'from inspect import isawaitable \nclass A: pass \na = A()' 'isawaitable(a)'
1000000 loops, best of 3: 1.07 usec per loop
Hellzed commented 4 years ago

@ktosiek I used stock Ubuntu 19.10 python packages, so I'm not sure which compile flags were used.

If you intend to make a GraphQL service built on graphql-core-3 faster, I would recommend to target the parsing and validation steps, and make sure your clients always use variables for any changing value in the query.

This means instead of using graphql.graphql, you should use graphql.parse, graphql.validate then only graphql.execute separately, and introduce a caching step (with Redis or memcache for example) in between. This would also allow you to leverage libgraphqlparser if you're comfortable with CFFI or cTypes.

ktosiek commented 4 years ago

@Hellzed this issue started as a result of digging into cases like https://github.com/graphql-python/graphene/issues/268 - with many objects returned by one API call. Caching the parsed query would help, but the bulk of the time is spent in execution.

Hellzed commented 4 years ago

@ktosiek I actually wondered about the use of isawaitable() when I saw it comes from the inspect module, and remembered some old advice: "don't use inspect, it's slow". Never needed to verify that before.

How does Python 3.8 fare when running both Graphene v2 and Graphene v3 benchmarks? Does it just make both faster, or does it bring v3 closer to v2?

ewhauser commented 4 years ago

Caching the parsing and validation definitely helps, but the bulk of the time (especially for larger objects) is spent in execution. The best approach would be to introduce either an AOT or JIT compiler like https://github.com/zalando-incubator/graphql-jit that generated runtime code for optimal execution.

qeternity commented 4 years ago

I've just run this simple benchmark against latest v2 and v3 branches and we've still had a performance regression of nearly 30% (18.5 sec vs 24.3 sec)

Cito commented 4 years ago

@qeternity Can confirm it's currently still 30% even with Python 3.8.

I can also confirm that the main problem are still the isawaitable calls, particularly in the complete_value_catching_error method. And in the isawaitable calls, the isinstance check for collections.abc.Awaitable is the biggest time waster.

When I replaced that isinstance check in the inspect module with a simple hasattr(object, '__await__'), the benchmark ran faster than with the v2 branch.

Maybe we can live with such a "simplified" isawaitable method? It's currently unclear to me, why and when it does not suffice to simply check for the __await__ attribute so that an instance check against Abc.Awaitable would be necessary. Any Python gurus here who can shed light on this, or who has some time to discuss this on the Python mailing list and report back?

Note: We must keep in mind that the above benchmark is only testing the path where all resolvers are synchronous. If we start to optimize, we must also measure with a benchmark with asynchronous code in order to not deteriorate the performance for that case.

Cito commented 4 years ago

Another idea: Maybe we should add an isawaitable parameter to the execute function. By default we could use the simplified, faster isawaitable check mentioned above, and maybe it could even be set to a function that always returns False when execute is called by the graphql_sync function to make that path even faster.

qeternity commented 4 years ago

There's also this which I opened some time ago only to fall on deaf ears: https://github.com/syrusakbary/promise/issues/82

Cito commented 4 years ago

@qeternity This issue tracker only covers GraphQL-core v3. And please understand that these are all projects that various people work on for free in their spare time. You can't expect people to work on issues - they may have currently other priorities in life. The only way to move forward is to contribute by sending PRs, become a maintainer of stalled projects and motivating other people.

qeternity commented 4 years ago

@Cito thanks, I absolutely don't expect anything. I have spent an incredible amount of time profiling and benchmarking this. With respect to scope of issue tracker, the OP benchmark is graphene, not graphql-core. I wanted to make sure we were all aware of performance regression in libraries that graphene depends on so as not to chase ghosts in this issue tracker.

Cito commented 4 years ago

@qeternity Graphene is only a wrapper around the GraphQL-core types. The query execution and the heavy lifting happens in GraphQL-core, so that's in fact the right place to optimize execution performance. The promise library mentioned above is only used in GraphQL-core v2 (Graphene v2) which is considered deprecated, that's why you don't see much activity there any more.

qeternity commented 4 years ago

@Cito yep, I'm aware. But if this delta between graphene v2 and v3 is closing (which is ultimately what we're benchmarking here) due to regressions in graphene v2 (vis-a-vis Promises regressions) instead of improvements in v3, I presumed that was relevant to our discussion. But your point is well taken.

Cito commented 4 years ago

@qeternity Ah ok now I understand your point. You were actually saying you observed a v2 regression in the same benchmark we are discussing here, and since we are comparing with v2 as our base line, we should be aware of that. That's in fact relevant to our discussion.

It's interesting that you observed an update of the promises lib to affect this (synchronous) benchmark. I had a quick look - the main thing that has been changed in the promises lib update is something with caching, fixing an out-of-memory error. Maybe it's just slower because of the additional garbage collection. Benchmarking is not easy. This benchmark only measures the time in this one special case, not "the" performance in general.

Cito commented 4 years ago

I have now implemented the ideas mentioned above - this improves the performance considerably. The benchmark with v3.1.0b2 is now running faster than with v2. Please let me know how it works for you and whether we can close this issue.

qeternity commented 4 years ago

@Cito Just had a look over the commit...nice, looks great! Not had a chance to run it yet, but will sit down with it tonight. I already had put aside some weekend quarantine time to start implementing some more aggressive hotpaths.

ktosiek commented 4 years ago

I can confirm it runs faster than v2, tested on Python 3.7 and 3.8. I think this issue is solved. Thank you, this will help a lot with migrating to v3!

ewhauser commented 4 years ago

Just to add on here - when using async, the performance gains appear to be quite large due to the elimination of promises. This example highlights the performance gains pretty as it is returning a large of number of results (5x iterations).

Here is graphl-core/graphene 2:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      210    0.407    0.002   49.608    0.236 /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py:1675(_run_once)
   180304    0.208    0.000   46.979    0.000 /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/events.py:86(_run)
   180304    0.329    0.000   46.771    0.000 {method 'run' of 'Context' objects}
   180060    0.097    0.000   25.413    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:478(resolve)
315125/180065    0.564    0.000   25.316    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:189(_resolve_callback)
    90035    0.084    0.000   25.052    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:841(handle_future_result)
405130/180050    0.625    0.000   24.412    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:229(_fulfill)
180045/90030    0.107    0.000   24.067    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/async_.py:54(settle_promises)
180045/90030    0.219    0.000   24.012    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/async_.py:36(_async_settle_promise)
180050/90035    0.115    0.000   23.858    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/async_.py:127(queue_tick)
    90035    0.045    0.000   23.766    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/schedulers/immediate.py:9(call)
    90035    0.175    0.000   23.722    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/async_.py:119(drain_queues)
   180070    0.661    0.000   23.516    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/async_.py:71(drain_queue)
315105/180045    0.403    0.000   22.259    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:457(_settle_promises)
315105/180045    0.177    0.000   21.887    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:334(_settle_promise0)
315110/180050    0.473    0.000   21.781    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:306(_settle_promise)
135065/105040    0.162    0.000   21.180    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:345(_settle_promise_from_handler)
   135065    0.111    0.000   19.625    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/promise/promise.py:84(try_catch)
    90025    0.085    0.000   19.430    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/executor.py:531(<lambda>)
225065/90030    0.863    0.000   19.345    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/executor.py:500(complete_value)
    45010    0.061    0.000   17.934    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/executor.py:700(complete_object_value)
    45015    0.265    0.000   17.727    0.000 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/executor.py:230(execute_fields)
       10    0.000    0.000   17.025    1.703 
... user code ...

vs 3:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      175    0.049    0.000   15.466    0.088 /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py:1675(_run_once)
      270    0.000    0.000   14.621    0.054 /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/events.py:86(_run)
      270    0.001    0.000   14.621    0.054 {method 'run' of 'Context' objects}
... user code ...
 45015/10    0.141    0.000    2.268    0.227 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:361(execute_fields)
 90025/15    0.223    0.000    2.268    0.151 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:522(resolve_field)
       10    0.000    0.000    2.268    0.227 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:630(await_completed)
       10    0.000    0.000    2.268    0.227 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:615(await_result)
 135040/5    0.260    0.000    2.267    0.453 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:660(complete_value)
  45010/5    0.024    0.000    2.267    0.453 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:899(complete_object_value)
  45010/5    0.045    0.000    2.267    0.453 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:934(collect_and_execute_subfields)
105025/15    0.094    0.000    2.267    0.151 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:599(complete_value_catching_error)
        5    0.024    0.005    2.266    0.453 /Users/ewhauser/.venvs/drone_web/lib/python3.7/site-packages/graphql/execution/execute.py:744(complete_list_value)

It does appear that document validation is 2x slower in the newer version of core, but I'll open a separate issue. Nice work!

Cito commented 4 years ago

@ewhauser Thanks for the feedback. Yes, it would be good to open a new issue for performance improvements in validation. Maybe you can run a profiler and identify some bottlenecks already.