holoviz / panel

Panel: The powerful data exploration & web app framework for Python
https://panel.holoviz.org
BSD 3-Clause "New" or "Revised" License
4.78k stars 518 forks source link

Panel server blocking for more than a minute several times during 5 min load test. #3105

Open MarcSkovMadsen opened 2 years ago

MarcSkovMadsen commented 2 years ago

I'm trying to measure the performance of Pythons data app frameworks. I'm using Locust to measure the performance of the initial page load. @jbednar pointed out it would be interesting here https://twitter.com/JamesABednar/status/1481319080598683648?s=20.

As you can see below. There is a very, very big difference between the medium and the max time it takes to get the initial page response. The server is sometimes not responding for more than a minute.

image

https://user-images.githubusercontent.com/42288570/149672116-b922d775-3d45-4ac5-83fe-2f4df8febcd0.mp4

(video has been speeded up x4 to be able to upload it)

In the video you can see the first time the server is blocked is at ~13s/ 52s.

At around 58s/ 3mins 52s an exception is raised.

Discussion

My hypothesis is that the process is not performing at its max. I did see any indications that its constrained by the 1 core or memory available. Instead there is some code problem. Is it trying to push a message through a web socket and waiting for a response that never comes?

I did similar tests on Streamlit and Dash. They don't exhibit this behavior. Streamlit has a very different approach. It just sends the same small javascript to initiate the web socket - independent of the application. It's a factor 100 faster than Panel (1ms in some cases). But then the page load to the user takes some time. In my preliminary experiments I believe Panel actually renders to the user faster.

The video test was running on my windows laptop. But I also did similar experiments running locust and panel in a linux docker container. See my repo https://github.com/MarcSkovMadsen/data-app-performance

Reproducible Example

requirements.txt

panel==0.12.6
plotly==5.5.0
locust==2.5.1

slider_plot_panel.py

import plotly.express as px

def plot(count):
    x = [str(i) for i in range(0, count)]
    y = list(range(0, count))
    return px.bar(x=x, y=y)

import panel as pn

pn.extension("plotly", sizing_mode="stretch_width", template="fast")

count = pn.widgets.IntSlider(name="Count", start=1, end=50, value=10).servable(area="sidebar")
pn.panel(pn.bind(plot, count)).servable(title="Panel - Slider with Plot Performance Test")

page_load_panel.py

from locust import HttpUser, task

class HelloWorldUser(HttpUser):
    @task
    def hello_world(self):
        self.client.get("/slider_plot_panel")
locust -f page_load_panel.py --host http://localhost:5006 --users 10 & panel serve slider_plot_panel.py

Stack trace (from video)

$ locust -f page_load_panel.py --host http://localhost:5006 --users 10 & panel serve slider_plot_panel.py  
[1] 1946
[2022-01-16 18:48:59,301] PC81978/INFO/locust.main: Starting web interface at http://0.0.0.0:8089 (accepting connections from all network interfaces)
[2022-01-16 18:48:59,315] PC81978/INFO/locust.main: Starting Locust 2.5.1
2022-01-16 18:49:00,513 Starting Bokeh server version 2.4.2 (running on Tornado 6.1)
2022-01-16 18:49:00,515 User authentication hooks NOT provided (default user enabled)
2022-01-16 18:49:00,521 Bokeh app running at: http://localhost:5006/slider_plot_panel
2022-01-16 18:49:00,521 Starting Bokeh server with process id: 28868
2022-01-16 18:49:04,451 WebSocket connection opened
2022-01-16 18:49:04,452 ServerConnection created
2022-01-16 18:49:29,413 WebSocket connection closed: code=1001, reason=None
2022-01-16 18:49:29,841 WebSocket connection opened
2022-01-16 18:49:29,841 ServerConnection created
2022-01-16 18:49:31,686 WebSocket connection closed: code=1001, reason=None
2022-01-16 18:49:31,940 WebSocket connection opened
2022-01-16 18:49:31,941 ServerConnection created
[2022-01-16 18:49:34,616] PC81978/INFO/locust.runners: Ramping to 10 users at a rate of 1.00 per second
[2022-01-16 18:49:43,670] PC81978/INFO/locust.runners: All users spawned: {"HelloWorldUser": 10} (10 total 
users)
2022-01-16 18:51:18,789 WebSocket connection closed: code=1001, reason=None
2022-01-16 18:51:19,321 WebSocket connection opened
2022-01-16 18:51:19,669 ServerConnection created
2022-01-16 18:53:23,613 WebSocket connection opened
2022-01-16 18:53:23,986 Failed sending message as connection was closed
2022-01-16 18:53:23,986 Failed sending message as connection was closed
2022-01-16 18:53:23,987 ServerConnection created
2022-01-16 18:53:24,621 Failed sending message as connection was closed
2022-01-16 18:53:25,891 Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x000001CB2008F5E0>>, <Task finished name='Task-1370' coro=<_needs_document_lock.<locals>._needs_document_lock_wrapper() done, defined at c:\repos\private\data-app-performance\.venv\lib\site-packages\bokeh\server\session.py:77> exception=WebSocketClosedError()>)   
Traceback (most recent call last):
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\tornado\ioloop.py", line 741, in _run_callback
    ret = callback()
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\tornado\ioloop.py", line 765, in _discard_future_result
    future.result()
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\bokeh\server\session.py", line 97, in _needs_document_lock_wrapper
    result = await result
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\tornado\gen.py", line 216, in wrapper
    result = ctx_run(func, *args, **kwargs)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\panel\reactive.py", line 279, in _change_coroutine
    self._change_event(doc)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\panel\reactive.py", line 289, in _change_event
    self._process_events(events)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\panel\reactive.py", line 258, in _process_events
    state.busy = True
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\param\parameterized.py", line 353, in _f
    return f(self, obj, val)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\param\parameterized.py", line 1228, in __set__
    obj.param._call_watcher(watcher, event)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\param\parameterized.py", line 2018, in _call_watcher
    self_._execute_watcher(watcher, (event,))
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\param\parameterized.py", line 2000, in _execute_watcher
    watcher.fn(*args, **kwargs)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\param\parameterized.py", line 649, in caller
    return function()
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\param\parameterized.py", line 387, in _depends
    return func(*args, **kw)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\panel\io\state.py", line 119, in _update_busy
    indicator.value = self.busy
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\param\parameterized.py", line 351, in _f
    instance_param.__set__(obj, val)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\param\parameterized.py", line 353, in _f
    return f(self, obj, val)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\param\parameterized.py", line 1228, in __set__
    obj.param._call_watcher(watcher, event)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\param\parameterized.py", line 2018, in _call_watcher
    self_._execute_watcher(watcher, (event,))
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\param\parameterized.py", line 2000, in _execute_watcher
    watcher.fn(*args, **kwargs)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\panel\reactive.py", line 253, in _param_change
    self._apply_update(events, msg, model, ref)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\panel\reactive.py", line 207, in _apply_update
    self._update_model(events, msg, root, model, doc, comm)
  File "C:\Program Files\Python39\lib\contextlib.py", line 124, in __exit__
    next(self.gen)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\panel\io\server.py", line 452, in unlocked
    WebSocketHandler.write_message(socket, msg.header_json)
  File "c:\repos\private\data-app-performance\.venv\lib\site-packages\tornado\websocket.py", line 337, in write_message
    raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
c:\repos\private\data-app-performance\.venv\lib\site-packages\_plotly_utils\utils.py:324: RuntimeWarning:

coroutine 'WSHandler.send_message' was never awaited

2022-01-16 18:53:26,511 Failed sending message as connection was closed
2022-01-16 18:53:28,420 Failed sending message as connection was closed
2022-01-16 18:53:30,355 Failed sending message as connection was closed
2022-01-16 18:53:32,299 Failed sending message as connection was closed
2022-01-16 18:53:34,088 Failed sending message as connection was closed
2022-01-16 18:53:34,442 WebSocket connection closed: code=None, reason=None
MarcSkovMadsen commented 2 years ago

I can see the behavior is also there if the test is just clicking "refresh" in the browser as fast as it feel possible.

https://user-images.githubusercontent.com/42288570/149675554-5a4fd35b-2cf6-4e73-93ab-bc592351a1d2.mp4

You can dig into the details of these tests here @jbednar and @philippjfr

philippjfr commented 2 years ago

Thanks @MarcSkovMadsen! I've observed something similar in production but thought it was down to that specific application. This is probably the highest priority issue on the bug tracker so I'll report back asap.

philippjfr commented 2 years ago

Okay, figured out what the issue is here. What happens is that after a session is shut down it gets cleaned up, if there's several hundred sessions being created then cleaning them all up happens in sequence. One session cleanup takes about 80 ms on my machine

2022-01-17 15:34:27,913 Deleting 1 modules for document <bokeh.document.document.Document object at 0x7fdb3fc55670>
2022-01-17 15:34:27,958 Session 'AY0W4744u58xUmaFhOK6mLfiMxtpGyL8cEyGn73LfPDH' was successfully discarded
2022-01-17 15:34:27,995 Discarding session 'eEoaDv54FqVVTjUg2G45HU0n4jRKkVuUfp5Enit6ZyRs' last in use 30073.157223000002 milliseconds ago

For 500 sessions that takes about 40 seconds during which no new requests are processed. I'm not hugely worried about this since it's outside the general scope of the usage patterns I would expect from a Panel application. Nonetheless it would be nice if we could get Bokeh to prioritize requests to process new sessions over cleaning up old sessions. I will also look into what it will take to speed up creation of a new session.

philippjfr commented 2 years ago

I did similar tests on Streamlit and Dash. They don't exhibit this behavior. Streamlit has a very different approach. It just sends the same small javascript to initiate the web socket - independent of the application.

This is indeed substantially different to Bokeh, which creates a user session immediately on the first request and sends back an HTML page containing all the required JS/CSS for the page and the JS to open the websocket. So when the websocket request arrives the session will already be warmed up. IMO although this means the initial request isn't as fast as it could be this approach is still preferable overall since the page can load resources at the same time as the websocket connection opens ensuring that those two things can happen in parallel.

philippjfr commented 2 years ago

There's a relatively simple improvement we can make in Bokeh which should speed up session cleanup by a factor of about 5-10x. Currently each time a Document is destroyed Bokeh will call gc.collect() which is quite expensive. When many sessions are cleaned up in quick succession this is particularly superfluous so instead we should only gc.collect() after all expired sessions have been discarded. For 500 sessions this will still result in a non-responsive server for 5-10 seconds but that's definitely preferable over the current 40-50 second wait.

MarcSkovMadsen commented 2 years ago

What triggers the "clean up". What I saw was that I could create 100-300 requests/ session. But then suddenly the "clean up" was triggered and everything was on hold.

Just brainstorming.

philippjfr commented 2 years ago

Good question. I think N should be some multiple of the number of users that can simultaneously interact with the app. While streamlit and dash can serve 10k sessions quickly they will face the same problem when they actually connect. The server will have to provide all the initial state to those sessions and that's where I think they'll hit similar limits. So I'd love to extend the actual test here to try to open the websocket or make the HTTP request dash makes.

There was some discussion a long time ago allowing the initial request in bokeh to be serviced without creating a session. That would close the gap for the number of requests Panel can handle in your current test.

As for what triggers the cleanup, it is triggered by a timeout. Every N seconds a background task collects all sessions without an active websocket connection and if they've reached some expiration time they are destroyed.

philippjfr commented 2 years ago

I'd say the number of users that can use an individual server and have a good experience depends a lot on the app and it's usage patterns. For panel that number probably ranges from 1-50 for apps requiring at least some Python callbacks. The new --num-threads option can hopefully increase that number a bit. Beyond that you'll need --num-procs or horizontal scaling with a reverse proxy.

So I think we should be able to create 100 simultaneous sessions in fairly quick succession and without major slowdowns during cleanup.

Being able to serve the initial request without creating a new session would also be nice and shouldn't be too difficult. It could then still warm up the session in the background so it is available as soon as the websocket request arrives.