Open JMurph2015 opened 7 years ago
Besides the two network hops (websockets, zmq), we deserialise, reserialise, deserialise the message, and do a fair bit of manipulation in Python and Javascript. So I'm not sure whether there's one big cause of the slowness you see or dozens of small ones.
The code I'm sending to be executed is literally 'x=10+25**2'. When I run it 1000 times locally, error in the timing mechanism is still a similar order of magnitude aka estimating 10^-5 ms per run.
I think that's deceptive. Python doesn't do much optimisation, but one thing it does do is constant folding, so the bytecode for your code is equivalent to x=635
.
# This roughly matches what you see
12│ %%timeit
│ x=10+25**2
│
12.8 ns ± 0.0428 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
# If we prevent it from doing constant folding, it's an order of magnitude slower
13│ a = 25
14│ %%timeit
│ x=10+a**2
│
220 ns ± 1.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# Dynamic execution loses us another 2 orders of magnitude
16│ %%timeit
│ exec('x=10+a**2')
│
│
11.6 µs ± 44.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# And IPython's machinery is another order of magnitude over that.
# This still doesn't include the serialisation and TCP steps.
18│ %%timeit
│ get_ipython().run_cell('x=10+a**2')
339 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I guess we could use an Echo Kernel to actually do performance measurement on jupyter_client, then you start the kernel with python -m cProfile [-o output_file] echokernel.py
and dig into the output file to see what's the bottle neck is.
We can also look at https://github.com/QuantStack/xeus and make a Python kernel using the Python C API using this, and it should be faster. Though we should make sure that the bottleneck is still actually jupyter_client and not IPython itself.
So one of the sentences of my original question was somewhat imprecise in what I was saying, it was taking 8ms to get results back from the client, not from the kernel directly as I had mentioned, the issue is now updated to reflect that.
Secondarily, while I get that performance is probably a death by a thousand cuts, question of whether the code is taking ~0 vs ~0.4 ms still is somewhat dwarfed by the 6.5 ms still unaccounted for.
Presently, I'm in the process of building some more direct interfacing with the kernels to see what happens when I manage the zmq traffic myself. To this end, is there any good class that I can use that just manages the startup and shutdown of the kernels, but isn't so tightly integrated with the client functionality? Ideally I would like something that just started a kernel and returned to me the url and port to point the zmq sockets at.
Update: I have gotten it down to 5.2 ms per hit over ipc, using an optimized client (that's about halfway done). Some of this may boil down to just overhead in string processing etc. and/or the IPython kernel itself, but the jury is still out on that, and I'll be digging around to see if I can do any better.
Edit: More results Running against an echo_kernel (https://github.com/jupyter/echo_kernel) got the runtime down to 4.37ms, so there's about 0.9ms happening in the IPython kernel, which is acceptable I guess. This would mean that 1ms is unachievable here, but we could certainly shoot for 2ms (which is still a respectable 500 Hz).
More results: I ran timers at various points in the client process to see what was going on. The best metric by far was to log the time that I sent the message, then when processing 'execute_reply's calculating the time-of-flight for the kernel processing.
Here are the results in 50 run averages running against the echo kernel mentioned above:
Total Runtime | Serialization | Deserialization | Time-of-flight |
---|---|---|---|
5.0646391 ms | 0.2116923 ms | 0.3838787 ms | 4.2513698 ms |
There were some other values I recorded in order to get a more complete picture, but they were implementation specific and thus wouldn't be of much use to you guys. I guess this is mostly an IPython problem / IPython kernel wrapper problem?
PS You might be wondering why this runtime is longer than the previously quoted value, it's just that I started waiting on the 'idle' status from the kernel so that I could collect the stdout and stderr streams. The echo kernel didn't get magically slower.
cc @SylvainCorlay – using https://github.com/QuantStack/xeus he was apparently able to decrease some overhead of the Python implementation of the protocol in some case from 1s to negligible.
@Carreau @takluyver
I also did some testing on the blocking client in the current repo. This lead me to find a bug in my aync test script that left quite a bit of performance on the table (it was issuing commands only after the last one had finished despite using my async client class). So now I have updated numbers on performance across several kernels and the two clients. Times here are measured in milliseconds. These were measured by running a thousand test calls against an appropriately initialized kernel. Thus the warmup run for Python 3 looked like
from random import random\nx = random()\nprint(x)
and the actual test run sent instructions of
x = random()\nprint(x)
This was to avoid the constant folding issue from before since I don't know of any language having a way to cache results of an RNG usefully.
Kernel | Async | Blocking | Percent Improvement |
---|---|---|---|
octave | 7.429 | 8.414 | 12% |
python3 | 4.121 | 5.072 | 19% |
echo | 3.140 | 4.310 | 27% |
julia-0.6 | 2.226 | 2.189 | -2% |
Please note that the Julia numbers are using a patch that I have in place, but the PR for which hasn't come through yet.
So that brings me to my final question here. Would you guys be interested in an asyncio based client? I would have to do a little bit of asking, but I could probably make it happen. Of course, if it isn't really where this repo/project is going right now, that's fine too.
Update: Methodology comments
Thanks!
I'd be interested in an asyncio based client.
If anyone has a good idea of what's taking 58x the time it takes me to dispatch calls to receive messages back, that would be awesome (it would be even more awesome if you could point me in the direction of changing that)
@JMurph2015 is websocket compression enabled in your configuration?
@SylvainCorlay Sorry for being out of the loop on this project for so long, my fall was really busy! So I will try to whip up another PoC as most of my materials were with my summer internship, but I think I could get the bugs ironed out w.r.t. an asyncio client.
And I'm not sure if it was, I'll have to get a new test setup and rerun the numbers.
Hi! This issue is being dropped here after discussion over on this thread: https://github.com/jupyter/kernel_gateway/issues/255
Intro
Basically the client interface to kernels is demonstrably adding a non-trivial amount of latency to every call made against a Jupyter notebook. Now one might ask "who cares about 8ms of latency?" and I would say anyone who would theoretically like to hit the Jupyter backend more than ~100 times a second. Reducing the latency to ~1ms should be achievable with the minimal overhead associated with ZMQ and the really minimal amount of post processing actually happening to any particular message.
What I've Seen So Far
This all means that I have ~7 ms (give or take 0.4ms for latency due to tcp and Websockets) that is unaccounted for. It would take a reasonably large amount of code to cause 7ms of delay (for example a simple prime sieve can find several thousand prime numbers in that sort of time).
Actual Question
If anyone has a good idea of what's taking 58x the time it takes me to dispatch calls to receive messages back, that would be awesome (it would be even more awesome if you could point me in the direction of changing that)