maartenbreddels commented 5 years ago

Ref #363 #374 #395 #403

Machine/OS

MacBook Pro 15" 2017
2.9GHz Intel Core i7 (4 physical cores, 8 with hyperthreading
16GB ram.
Hyperthreading is enabled, meaning that e.g. 100% cpu usage can mean up to 200% real cpu usage, i.e. a full core is used, https://blogs.oracle.com/solaris/cpu-utilization-of-multi-threaded-architectures-explained-v2). TLDR, if 1 physical core with 2 hyperthreading is reported to use 50% cpu, it may actually use nearly all resources of that physical core, which means that adding a second process will not lead to a reduction of wallclock time by 2.

Current state

voila notebooks/md.ipynb --no-browser

(md.ipynb is a notebook with 1 markdown cell, so we don't talk at all about executing code)

Serial execution ($ ab -n 10 -c 1 http://localhost:8866/) gives 1.24 requests/second.

Parallel execution ($ ab -n 10 -c 10 http://localhost:8866/) gives 2.18 requests/second. Cpu usage is not always 100%.

Extreme parallel execution (ab -n 100 -c 50 http://localhost:8866/) can push it up to 2.5 requests per second

Threaded branch

403

Serial execution ($ ab -n 10 -c 1 http://localhost:8866/) gives 1.44 requests/second.

Parallel execution ($ ab -n 10 -c 10 http://localhost:8866/) gives 4.42 requests/second. CPU usage of the voila process is high (meaning voila is part of the bottleneck).

Extreme parallel execution (ab -n 100 -c 50 http://localhost:8866/) can push it up to 5.37 requests/second.

Async branch+tweaks

A hacked version of voila, where I ripped out the execution of cells, cache loading of notebook files from disk, and reuse the same VoilaExporter so the jinja templates are not recompiled for each request. Furthermore, I've avoided using nbconvert for making the connection to the client and used parts of #374 for this. I'll try to open separate PR's for this.

Serial execution ($ ab -n 10 -c 1 http://localhost:8866/) gives 1.45 requests/second.

Parallel execution ($ ab -n 10 -c 10 http://localhost:8866/) gives 5.52 requests/second.

Extreme parallel execution(ab -n 100 -c 50 http://localhost:8866/) gives 7 requests/second, voila takes 30% CPU, but system is at 800%/100% (all cores). Most of the bottleneck thus seems to be the kernel starting here.

Using xeus-python

Using the async version, but now hardcoding it to use the xeus-python (xpython) kernel.

Serial execution ($ ab -n 10 -c 1 http://localhost:8866/) gives 2.8 requests/second.

Parallel execution ($ ab -n 10 -c 10 http://localhost:8866/) gives 13.19 requests/second.

Extreme parallel execution ($ ab -n 100 -c 50 http://localhost:8866/) gives 16.43 requests/second. A significant >2x speedup!

Only starting a kernel

Taking out voila/tornado/jinja out of the equation, I explored how much time is spend on actually starting the kernel.

jupyter_client + async

Based on the async PR I started 200 kernels (xeus-python), at a rate of 16 kernel starts/second. Using py-spy and (GNU) time, that gave the following result:

. A lot of time is spend throughout different components of jupyter_clients, there is not single part that takes up most of the time

    User time (seconds): 4.87
    System time (seconds): 4.58
    Percent of CPU this job got: 73%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.78

CPU usage for the benchmark program is 70% (~0.5-1 core out of 4 is used). Quite some time is spent in overhead of jupyter_client it seems, for instance, Session is a configurable, which is created in write_configuration_file (below figure, on the left):

jupyter_kernel_mgmt + async

The same 200 kernels (xeus-python) are started with jupyter_kernel_mgmt (an alternative to a subset of jupyter_client), which is also hacked a bit to get more async communication, to see how far we can push it. This leads to 24 kernel kernel starts/second.

The benchmark/client program spends 17% CPU usage, significantly less than jupyter_client (time reports 31%, since it also includes startup time, but looking at htop it is 17% while starting the kernels). The flame graphs shows that most of the time is spend on launching the kernel, not much room for improvement (even if there is, this part does not seem to be a bottleneck)

    User time (seconds): 1.44
    System time (seconds): 1.23
    Percent of CPU this job got: 31%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.38

(PS: I cannot seem to run this benchmark with ipython, i get lots of zmq.error.ZMQError: Address already in use errors)

Conclusions

The xeus-python kernels seems to be >2x faster to start up, as indicated by the voila benchmarks.
jupyter_client has significant overhead compared to jupyter_kernel_mgmt (70% vs 17% cpu usage).
With the great performance of jupyter_kernel_mgmt, we are still limited by 24 kernels/second, if we want to go beyond that, we'd need to profile xeus-python or ipython.
403, and possibly a full async version (requires Python >=3.6) would speed up voila quite a bit.
After #403, some speedups are possible by not recompiling the jinja templates for each request (i.e. by sharing a single VoilaExporter).

Uncertain

I've taken out nbconvert, there is quite some overhead there as well, but I don't have number on it.
We are not executing code yet, when we we get kernel starting and voila in good shape, we can start benchmarking that as well.

Benchmarking

Side note

I've been exploring a bit how we can improve the performance of voila, during that exploration I can at least say about profilers:

profiling in Python is a bit of a challenge
- cProfile only profiles the main thread, that makes #403 very difficult to profile, while async is more easy to profile.
- mtprof only supports linux atm.
- yappi seems to work quite well, outputs to kcachegrind/qcachegrind output, but this is not always the best interface for understanding where most of the time is spend.
- py-spy is a sampling profiler which has a flame graph output, which is easier to interpret. (Since it is a sampling profiler, it does not give deterministic results).

jtpio commented 5 years ago

Thanks @maartenbreddels for this very detailed post.

SylvainCorlay commented 5 years ago

Using xeus-python

Using the async version, but now hardcoding it to use the xeus-python (xpython) kernel.

Serial execution ($ ab -n 10 -c 1 http://localhost:8866/) gives 2.8 requests/second.

Parallel execution ($ ab -n 10 -c 10 http://localhost:8866/) gives 13.19 requests/second.

Extreme parallel execution ($ ab -n 100 -c 50 http://localhost:8866/) gives 16.43 requests/second. A significant >2x speedup!

This may also be important information for people scheduling many notebook runs (for usecases such as papermill). cc @mseal @willingc @ellisonbg

Spawning ipykernel instances has a high cost compared to xeus-python.

SylvainCorlay commented 5 years ago

@JohanMabille regarding the flavor of xserver that we use for the xeus-python kernel, we may also like to use another model for the scheduling of notebooks compared to the interactive use for performance.

MSeal commented 5 years ago

Glad the parallel jupyter_client code was all in and worked :sweat:

Spawning ipykernel instances has a high cost compared to xeus-python.

That doesn't surprise me greatly. Having those base benchmarks is useful though, as I didn't have any calculations yet around those launch points.

Looking at the heatmap it appears that traitlets' session fetches are taking most of removable time in jupyter_client, which makes sense to me. I do think the traitlets code in that code path is a bit overkill for what it needs. Could be worth exploring something lighter weight for jupyter_client or doing some profiled optimizations of the session config loading therein later on.

SylvainCorlay commented 5 years ago

Spawning ipykernel instances has a high cost compared to xeus-python.

That doesn't surprise me greatly.

Would be interesting to know if that helps with papermill pipelines.

MSeal commented 5 years ago

Would be interesting to know if that helps with papermill pipelines.

It definitely would, though usually with the scheduled papermill use-cases I've seen in production systems the launch time is trivially against the code execution times so I haven't focused on it much. Where this would matter a lot more would be lambda or papermill-on-rest setups where launch time and dependency sizes matter a lot more.

MarcSkovMadsen commented 5 years ago

Just some comments from a simple developer who can see the power of Jupyter of Voila as a means of developing applications in Python in general.

I see Voila as something not only for data sciency apps. But for developing apps in Python in general. When I look at the Jupyter ecosystem there has already been built a tremendous amount of infrastructure like widgets and layouts that are far beyond what Streamlit and Dash can do.

But there is so much friction in using it (with Voila). One thing is the kernel starting time. That is Ok for "advanced users" in an enterprise. But if I would wan't to develop an application in Voila for the general web the kernel starting time is too slow. The users would probably abort a long time before that.

So i'm crossing my fingers here.

Thanks for the good work so far.

voila-dashboards / voila

Performance of voila #417

Machine/OS

Current state

Threaded branch

403

Async branch+tweaks

Using xeus-python

Only starting a kernel

jupyter_client + async

jupyter_kernel_mgmt + async

Conclusions

403, and possibly a full async version (requires Python >=3.6) would speed up voila quite a bit.

Uncertain

Side note

Using xeus-python