holoviz / panel

Panel: The powerful data exploration & web app framework for Python
https://panel.holoviz.org
BSD 3-Clause "New" or "Revised" License
4.77k stars 519 forks source link

Panel performance degredation while running load tests #6502

Open armaaar opened 7 months ago

armaaar commented 7 months ago

Once we tried to measure the performance of Panel apps using K6, we noticed huge degredation in performance for only 10 users using the app simultanously, even after enabling defer loading and refactor pages into classes.

We have created a "Hello, World!" Panel page in isolation and ran our load tests against it, which shows huge performance degredation the more the panel server is hit. We suspect that a huge part of it is from how Panel handles sessions. Apparently, sessions destruction (and maybe initialization as well) is not using threads and is blocking the whole server.

Here is a repository that demostrates the performance degredation of panel app, being hit by 10 users simultanously: https://github.com/armaaar/panel-performance

ALL software version info

Description of expected behavior and the observed behavior

Complete, minimal, self-contained example code that reproduces the issue

Please refer to our test repository here: https://github.com/armaaar/panel-performance. The readme file has instalation, how to run, and results documented.

Screenshots or screencasts of the bug in action

Everything can be found in the repo readme: https://github.com/armaaar/panel-performance/blob/master/README.md

MarcSkovMadsen commented 7 months ago

My experience is that Panel can handle the initial load of 2-3 users per second per process.

Its not been a problem for my use cases so far. But it also means that there are use cases I stay away from currently.

philippjfr commented 7 months ago

Thanks for the super helpful load testing framework, I have actually been working on moving the session creation to a thread and am aiming to have a version of that in Panel 1.5.0.

philippjfr commented 7 months ago

One quick question after reviewing your document, was the admin page always enabled when you ran the load tests? I'm asking because the admin dashboard itself may be adding non-insignificant overhead. After disabling it here are the timings I get with latest Panel:

     browser_data_received...............................: 1.8 GB  6.4 MB/s
     browser_data_sent...................................: 5.7 MB  20 kB/s
     browser_http_req_duration...........................: avg=385.23ms min=2.03ms   med=245.18ms max=11.74s   p(90)=584.01ms p(95)=736.06ms
     browser_http_req_failed.............................: 0.00%   ✓ 0        ✗ 14336
     browser_web_vital_cls...............................: avg=0        min=0        med=0        max=0        p(90)=0        p(95)=0
   ✗ browser_web_vital_fcp...............................: avg=1.54s    min=227.1ms  med=920.1ms  max=9.37s    p(90)=4.33s    p(95)=5.2s
     ✗ { scenario:load }.................................: avg=1.55s    min=227.1ms  med=925ms    max=9.37s    p(90)=4.34s    p(95)=5.2s
     ✓ { scenario:single_load }..........................: avg=880.29ms min=827.2ms  med=873.45ms max=961.5ms  p(90)=915.15ms p(95)=938.32ms
     ✓ { scenario:single }...............................: avg=334.69ms min=334.69ms med=334.69ms max=334.69ms p(90)=334.69ms p(95)=334.69ms
   ✗ browser_web_vital_lcp...............................: avg=1.67s    min=321.19ms med=1.08s    max=9.38s    p(90)=4.35s    p(95)=5.23s
     ✗ { scenario:load }.................................: avg=1.69s    min=321.19ms med=1.08s    max=9.38s    p(90)=4.36s    p(95)=5.23s
     ✓ { scenario:single_load }..........................: avg=1s       min=957.09ms med=988.3ms  max=1.13s    p(90)=1.03s    p(95)=1.08s
     ✓ { scenario:single }...............................: avg=453.89ms min=453.89ms med=453.89ms max=453.89ms p(90)=453.89ms p(95)=453.89ms
     browser_web_vital_ttfb..............................: avg=724.23ms min=43.19ms  med=327.04ms max=7.94s    p(90)=2.56s    p(95)=3.78s
   ✓ checks..............................................: 99.83%  ✓ 1184     ✗ 2
     ✓ { check:page_loaded }.............................: 99.83%  ✓ 592      ✗ 1
     ✓ { check:server_responded }........................: 99.83%  ✓ 592      ✗ 1
     ✓ { scenario:load,check:page_loaded }...............: 99.82%  ✓ 581      ✗ 1
     ✓ { scenario:load,check:server_responded }..........: 99.82%  ✓ 581      ✗ 1
     ✓ { scenario:load }.................................: 99.82%  ✓ 1162     ✗ 2
     ✓ { scenario:single_load,check:page_loaded }........: 100.00% ✓ 10       ✗ 0
     ✓ { scenario:single_load,check:server_responded }...: 100.00% ✓ 10       ✗ 0
     ✓ { scenario:single_load }..........................: 100.00% ✓ 20       ✗ 0
     ✓ { scenario:single,check:page_loaded }.............: 100.00% ✓ 1        ✗ 0
     ✓ { scenario:single,check:server_responded }........: 100.00% ✓ 1        ✗ 0
     ✓ { scenario:single }...............................: 100.00% ✓ 2        ✗ 0
     data_received.......................................: 0 B     0 B/s
     data_sent...........................................: 0 B     0 B/s
     iteration_duration..................................: avg=3.81s    min=542.74ms med=2.54s    max=29.5s    p(90)=6.52s    p(95)=7.09s
     iterations..........................................: 593     2.124812/s
     vus.................................................: 5       min=0      max=10
     vus_max.............................................: 10      min=10     max=10

The load timings in particular are significantly better, though of course still nowhere near where we want them to be:

     ✗ { scenario:load }.................................: avg=1.55s    min=227.1ms  med=925ms    max=9.37s    p(90)=4.34s    p(95)=5.2s
philippjfr commented 7 months ago

Okay, my investigations showed multiple things:

  1. The codepath you were using (i.e. passing a function to serve) bypasses a number of optimizations that we do implement for file based applications, e.g. surrounding session cleanup, and also does not perform profiling or tracking the actual session creation timings.
  2. It is relatively easy to get session creation to be threaded, how much that buys you and how robustly this is working is something I'm going to have to continue evaluating.
philippjfr commented 7 months ago

One more note about threading, the actual bottleneck likely isn't even the session creation itself (particularly for an app as simple as the one we are profiling) but rather that currently all websocket connections are owned by a single thread. This means that when the server has to send data to multiple sessions simultaneously, all the messages are queuing up to be sent to the frontend. This means that while sessions may be created in parallel, as long as the main thread is locked up none of the messages will actually be sent out.

philippjfr commented 7 months ago

The solution will likely be to maintain a completely independent thread pool that grabs events ready to be sent off of a queue and dispatches them.

MarcSkovMadsen commented 7 months ago

Try to look at Gradio. They are probably the data app framework that has had biggest scaling due to usage on Hugging Face.

My understanding is that every event is sent to queue that is shared across all users to enable scaling.

MarcSkovMadsen commented 7 months ago

You can also look at Streamlit they run every user in separate thread. It solves some problems but creates others from what I've heard.

MarcSkovMadsen commented 7 months ago

The main problem as I see it is that our "template" is quite heavy and custom.

Streamlit uses same minimal html template to start web socket. Then rest is sent. It means they can respond to many more initial http requests.

philippjfr commented 7 months ago

Their server architectures are very different so I don't think there's much we can steal from them. I have a pretty good idea how we can scale the applications better. Moving session creation and destruction to a thread is one part of it but the bigger part will be to move the websocket writes to a thread.

The main problem as I see it is that our "template" is quite heavy and custom.

This isn't really true, or rather isn't what's causing the problem. The problem is that we create a session (i.e. run the user code) before the initial template is sent. You can already work around this by enabling config.reuse_sessions or setting --reuse-sessions.

MarcSkovMadsen commented 7 months ago

I agree server architectures are very different and you cannot copy code. Still I believe there is a good probability of finding ideas and understanding scaling requirements in 2024.

cdeil commented 7 months ago

@armaaar - thanks for putting this together!

@philippjfr or @MarcSkovMadsen - is there a recommended way to define "page loaded" in such load tests for Panel?

I wanted to try with the latest panel==1.4.0rc2 but then it seems the same selector doesn't work any more?

https://github.com/armaaar/panel-performance/blob/405aca725cf83f4d1e9722457d3b60657610670d/k6.js#L83

ERRO[0053] Uncaught (in promise) GoError: waiting for selector ".pn-loading.pn-arc": waiting for selector ".pn-loading.pn-arc" did not result in any nodes
single  at github.com/grafana/xk6-browser/browser.mapPage.func13 (native)
single_lat file:///Users/cdeil/code/hc/panel-performance/k6.js:91:29(30)  executor=constant-vus scenario=load

If you think k6 is useful for Panel users would be great if you could add a mention to the docs with the best practice to define "page loaded" in a robust way. https://panel.holoviz.org/how_to/test/loadtests.html


One more thought on this:

browser_data_received...............................: 1.8 GB  6.4 MB/s

I think this load test is very atypical since for normal Panel apps there's a few MB of JS/CSS loaded once but then cached by the browser, but here possibly k6 constantly comes with clean browser caches and to a certain extent just puts load on Panel as a static file server?

cdeil commented 7 months ago

If I go to https://panel-gallery-dev.holoviz.dsp.anaconda.com/portfolio_analyzer I see page load times vary between sometimes 1 sec and sometimes 10 or 20 seconds.

Also I see that 8 MB resources (???) and 2.3 MB transferred but if I have browser cache on only 7 kB are transferred.

That's normal behaviour and correct interpretation of the Chrome network tab load timings and download sizes?

Screenshot 2024-03-16 at 21 32 49
philippjfr commented 7 months ago

That server is severely underserved right now unfortunately, we're working on it but for now those numbers may be right. I also had to update the k6 script, will make a PR for an improved benchmark on latest Panel tomorrow or Monday.

philippjfr commented 7 months ago

Okay, I tracked down the main issue that was plaguing the particular test case above. Specifically the problem was that during session cleanup Bokeh was calling gc.collect() for each individual sessions, which quickly adds up when called 20x consecutively. I had already worked around a separate gc.collect() call by scheduling the garbage collection 5 seconds after a session was destroyed and ensuring that would only be called once but had not realized Bokeh was performing one more completely separate GC cycle. I have now unified that into a single gc.collect() which is debounced to be run 5 seconds after any session was destroyed, i.e. if another session is destroyed in those 5 seconds it will again postpone it by 5 seconds.

While not perfect this did make a big difference to the timing:

     browser_data_received...............................: 2.3 GB  9.0 MB/s
     browser_data_sent...................................: 7.3 MB  29 kB/s
     browser_http_req_duration...........................: avg=199.74ms min=1.69ms   med=105.07ms max=2.74s    p(90)=473.84ms p(95)=594.83ms
     browser_http_req_failed.............................: 0.00%   ✓ 0       ✗ 18289
     browser_web_vital_cls...............................: avg=0        min=0        med=0        max=0        p(90)=0        p(95)=0
   ✗ browser_web_vital_fcp...............................: avg=775.63ms min=505.79ms med=650.7ms  max=3.15s    p(90)=1.14s    p(95)=1.5s
     ✗ { scenario:load }.................................: avg=773.3ms  min=505.79ms med=644.5ms  max=3.15s    p(90)=1.14s    p(95)=1.5s
     ✓ { scenario:single_load }..........................: avg=965.38ms min=908.5ms  med=956.44ms max=1.09s    p(90)=994.71ms p(95)=1.04s
     ✓ { scenario:single }...............................: avg=594ms    min=594ms    med=594ms    max=594ms    p(90)=594ms    p(95)=594ms
   ✓ browser_web_vital_lcp...............................: avg=1.18s    min=880.5ms  med=1.04s    max=3.17s    p(90)=1.68s    p(95)=1.96s
     ✓ { scenario:load }.................................: avg=1.17s    min=880.5ms  med=1.03s    max=3.17s    p(90)=1.68s    p(95)=1.96s
     ✓ { scenario:single_load }..........................: avg=1.57s    min=1.45s    med=1.55s    max=1.68s    p(90)=1.67s    p(95)=1.67s
     ✓ { scenario:single }...............................: avg=994.89ms min=994.89ms med=994.89ms max=994.89ms p(90)=994.89ms p(95)=994.89ms
     browser_web_vital_ttfb..............................: avg=173.9ms  min=39.5ms   med=111.6ms  max=1.33s    p(90)=352.05ms p(95)=506.59ms
   ✓ checks..............................................: 100.00% ✓ 1498    ✗ 0
     ✓ { check:page_loaded }.............................: 100.00% ✓ 749     ✗ 0
     ✓ { check:server_responded }........................: 100.00% ✓ 749     ✗ 0
     ✓ { scenario:load,check:page_loaded }...............: 100.00% ✓ 738     ✗ 0
     ✓ { scenario:load,check:server_responded }..........: 100.00% ✓ 738     ✗ 0
     ✓ { scenario:load }.................................: 100.00% ✓ 1476    ✗ 0
     ✓ { scenario:single_load,check:page_loaded }........: 100.00% ✓ 10      ✗ 0
     ✓ { scenario:single_load,check:server_responded }...: 100.00% ✓ 10      ✗ 0
     ✓ { scenario:single_load }..........................: 100.00% ✓ 20      ✗ 0
     ✓ { scenario:single,check:page_loaded }.............: 100.00% ✓ 1       ✗ 0
     ✓ { scenario:single,check:server_responded }........: 100.00% ✓ 1       ✗ 0
     ✓ { scenario:single }...............................: 100.00% ✓ 2       ✗ 0
     data_received.......................................: 0 B     0 B/s
     data_sent...........................................: 0 B     0 B/s
     iteration_duration..................................: avg=1.91s    min=1.1s     med=1.5s     max=7.05s    p(90)=3.14s    p(95)=4.19s
     iterations..........................................: 749     2.99838/s
     vus.................................................: 5       min=0     max=10
     vus_max.............................................: 20      min=20    max=20

The average loading time is now well below 1 second and even the p95 is "only" 1.5 seconds.