Open maartenbreddels opened 5 years ago
Thanks @maartenbreddels for this very detailed post.
Using xeus-python
Using the async version, but now hardcoding it to use the xeus-python (xpython) kernel.
Serial execution (
$ ab -n 10 -c 1 http://localhost:8866/
) gives 2.8 requests/second.Parallel execution (
$ ab -n 10 -c 10 http://localhost:8866/
) gives 13.19 requests/second.Extreme parallel execution (
$ ab -n 100 -c 50 http://localhost:8866/
) gives 16.43 requests/second. A significant >2x speedup!
This may also be important information for people scheduling many notebook runs (for usecases such as papermill). cc @mseal @willingc @ellisonbg
Spawning ipykernel instances has a high cost compared to xeus-python.
@JohanMabille regarding the flavor of xserver that we use for the xeus-python kernel, we may also like to use another model for the scheduling of notebooks compared to the interactive use for performance.
Glad the parallel jupyter_client code was all in and worked :sweat:
Spawning ipykernel instances has a high cost compared to xeus-python.
That doesn't surprise me greatly. Having those base benchmarks is useful though, as I didn't have any calculations yet around those launch points.
Looking at the heatmap it appears that traitlets' session fetches are taking most of removable time in jupyter_client, which makes sense to me. I do think the traitlets code in that code path is a bit overkill for what it needs. Could be worth exploring something lighter weight for jupyter_client or doing some profiled optimizations of the session config loading therein later on.
Spawning ipykernel instances has a high cost compared to xeus-python.
That doesn't surprise me greatly.
Would be interesting to know if that helps with papermill pipelines.
Would be interesting to know if that helps with papermill pipelines.
It definitely would, though usually with the scheduled papermill use-cases I've seen in production systems the launch time is trivially against the code execution times so I haven't focused on it much. Where this would matter a lot more would be lambda or papermill-on-rest setups where launch time and dependency sizes matter a lot more.
Just some comments from a simple developer who can see the power of Jupyter of Voila as a means of developing applications in Python in general.
I see Voila as something not only for data sciency apps. But for developing apps in Python in general. When I look at the Jupyter ecosystem there has already been built a tremendous amount of infrastructure like widgets and layouts that are far beyond what Streamlit and Dash can do.
But there is so much friction in using it (with Voila). One thing is the kernel starting time. That is Ok for "advanced users" in an enterprise. But if I would wan't to develop an application in Voila for the general web the kernel starting time is too slow. The users would probably abort a long time before that.
So i'm crossing my fingers here.
Thanks for the good work so far.
Ref #363 #374 #395 #403
Machine/OS
Current state
current master Voila started with:
(
md.ipynb
is a notebook with 1 markdown cell, so we don't talk at all about executing code)Serial execution (
$ ab -n 10 -c 1 http://localhost:8866/
) gives 1.24 requests/second.Parallel execution (
$ ab -n 10 -c 10 http://localhost:8866/
) gives 2.18 requests/second. Cpu usage is not always 100%.Extreme parallel execution (
ab -n 100 -c 50 http://localhost:8866/
) can push it up to 2.5 requests per secondThreaded branch
403
Serial execution (
$ ab -n 10 -c 1 http://localhost:8866/
) gives 1.44 requests/second.Parallel execution (
$ ab -n 10 -c 10 http://localhost:8866/
) gives 4.42 requests/second. CPU usage of the voila process is high (meaning voila is part of the bottleneck).Extreme parallel execution (
ab -n 100 -c 50 http://localhost:8866/
) can push it up to 5.37 requests/second.Async branch+tweaks
A hacked version of voila, where I ripped out the execution of cells, cache loading of notebook files from disk, and reuse the same VoilaExporter so the jinja templates are not recompiled for each request. Furthermore, I've avoided using nbconvert for making the connection to the client and used parts of #374 for this. I'll try to open separate PR's for this.
Serial execution (
$ ab -n 10 -c 1 http://localhost:8866/
) gives 1.45 requests/second.Parallel execution (
$ ab -n 10 -c 10 http://localhost:8866/
) gives 5.52 requests/second.Extreme parallel execution(
ab -n 100 -c 50 http://localhost:8866/
) gives 7 requests/second, voila takes 30% CPU, but system is at 800%/100% (all cores). Most of the bottleneck thus seems to be the kernel starting here.Using xeus-python
Using the async version, but now hardcoding it to use the xeus-python (xpython) kernel.
Serial execution (
$ ab -n 10 -c 1 http://localhost:8866/
) gives 2.8 requests/second.Parallel execution (
$ ab -n 10 -c 10 http://localhost:8866/
) gives 13.19 requests/second.Extreme parallel execution (
$ ab -n 100 -c 50 http://localhost:8866/
) gives 16.43 requests/second. A significant >2x speedup!Only starting a kernel
Taking out voila/tornado/jinja out of the equation, I explored how much time is spend on actually starting the kernel.
jupyter_client + async
Based on the async PR I started 200 kernels (xeus-python), at a rate of 16 kernel starts/second. Using py-spy and (GNU) time, that gave the following result:
. A lot of time is spend throughout different components of jupyter_clients, there is not single part that takes up most of the time
CPU usage for the benchmark program is 70% (~0.5-1 core out of 4 is used). Quite some time is spent in overhead of jupyter_client it seems, for instance, Session is a configurable, which is created in write_configuration_file (below figure, on the left):
jupyter_kernel_mgmt + async
The same 200 kernels (xeus-python) are started with jupyter_kernel_mgmt (an alternative to a subset of jupyter_client), which is also hacked a bit to get more async communication, to see how far we can push it. This leads to 24 kernel kernel starts/second.
The benchmark/client program spends 17% CPU usage, significantly less than jupyter_client (time reports 31%, since it also includes startup time, but looking at htop it is 17% while starting the kernels). The flame graphs shows that most of the time is spend on launching the kernel, not much room for improvement (even if there is, this part does not seem to be a bottleneck)
(PS: I cannot seem to run this benchmark with ipython, i get lots of
zmq.error.ZMQError: Address already in use
errors)Conclusions
403, and possibly a full async version (requires Python >=3.6) would speed up voila quite a bit.
Uncertain
Benchmarking
Side note
I've been exploring a bit how we can improve the performance of voila, during that exploration I can at least say about profilers: