Stress test 5000 concurrent users

yuvipanda commented 7 years ago

I want to be able to simulate 5000 concurrent users, doing some amount of notebook activity (cell execution) and getting results back in some reasonable time frame.

This will require that we test & tune:

JupyterHub
Proxy implementation
Spawner implementation
Kubernetes itself

Ideally, we want Kubernetes itself to be our bottleneck, and our components introduce no additional delays.

This issue will document and track efforts to achieve this, as well as to define what 'this' is.

yuvipanda commented 7 years ago

So far, we've:

Switched to using pycurl inside the hub, to make communication with the proxy service easier
Make polls orders of magnitude faster in kubespawner with https://github.com/jupyterhub/kubespawner/pull/63
Start the process of making connection pool's maxsize configurable in the official kubernetes client: https://github.com/kubernetes-client/python-base/pull/18

yuvipanda commented 7 years ago

A big problem we ran into was that we were constantly hit by redirect loops. Me and @minrk dug through this (yay teamwork? I made some comments and went to sleep and he had solved it when I woke up!), and https://github.com/jupyterhub/jupyterhub/pull/1221 seems to get rid of them all!!!

When I ran a 2000 user test, about 150 failed to start. Some of them were because of this curl error:

[E 2017-07-14 20:09:46.147 JupyterHub ioloop:638] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f8cfbdc1ea0>, <tornado.concurrent.Future object at 0x7f8c30a87048>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 605, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
        return fn(*args, **kwargs)
      File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 626, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/app.py", line 1382, in update_last_activity
        routes = yield self.proxy.get_all_routes()
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/proxy.py", line 556, in get_all_routes
        resp = yield self.api_request('', client=client)
    tornado.curl_httpclient.CurlError: HTTP 599: Operation timed out after 20553 milliseconds with 163693 bytes received

This could patially be fixed by tuning pycurl. However, it failed a 20s timeout, which is pretty generous. I think a k8s ingress based Proxy implementation would help here, and might be the next bottleneck to solve.

The spawn times increased linearly with time, eventually going over the 300s timeout we have for image starts. We should take a profiler to the hub process (not sure how to profile tornado code tho haha) and figure out what's the thing that takes linear amounts of time.

yuvipanda commented 7 years ago

Ran another 5000, and am now convinced that next step is a proxy implementation that does not need to make a network request for each get_all_routes()

minrk commented 7 years ago

I suspect that slowness of get_all_routes may be a bit misleading, because lots of actions are using the same IOLoop and even httpclient waiting for Spawners to start, etc. My guess is that it's all tied up in the busy-ness of the Hub IOLoop. Plus, get_all_routes only happens once every five minutes, so it seems a bit odd for it to be a major contributing factor.

When I run get_all_routes on its own when things are otherwise idle, it takes 50ms to fetch 20k routes, so it's unlikely that the fetch is actually what's taking that time. My guess is that it's something like:

start timer
initiate fetch
yield to other tasks
check timer (all those other tasks ate my 20 seconds!)

Under load, presumably CHP would take longer to respond to the request, but 20s seems like a lot compared to 50ms.

I think it's important to separate what we're testing as concurrent:

concurrent spawns
concurrent active users (talking only to their server, not the Hub)

In particular, JupyterHub is based on the idea that login/spawn is a relatively infrequent event compared to actually talking to their server, so while stress testing all users concurrently spawning is useful as a worst-case scenario, it doesn't give an answer to the question "how many users can I support" because it is not a realistic load of concurrent users.

I'd love a load test that separated the number of people trying to spawn from the number of people concurrently trying to use their running notebook servers (opening & saving, running code, etc .). e.g. being able to ask the questions:

What's the performance of 5k users when nobody is spawning new servers
What's the performance of one login when 5k users are active
100 users always logging in / out, 5k active
etc.

yuvipanda commented 7 years ago

+1 on load testing a matrix of factors!

We should also identify what 'performance' means. I think it's:

Total time to get from login to notebook on browser (+ failure counts)
Time taken to start a kernel and execute a reference notebook over time (this will happen in a loop)

Anything else, @minrk? If we nail these down as things we wanna measure, then we can form the test matrix properly...

willingc commented 7 years ago

This is awesome @yuvipanda

yuvipanda commented 7 years ago

I think ideal way for us to stress test is that we can define a function that'll draw a 'current active users' graph - spikes will get more useres logging in, slumps will have users dropping out, and steady states will do nothing. This will let us generate varying kinds of load by just tweaking the shape of this graph (which you can do interactively).

minrk commented 7 years ago

That looks like a good starting point.

Some 'performance' metrics:

Hub:

login + spawn time
HTTP roundtrip (GET /hub/home)
API roundtrip (GET /hub/api/users)

Singleuser (spawned and ready to go):

full-go (open notebook, start kernel, run all, save)
websocket roundtrip (single message)
HTTP roundtrip (GET /api/contents)

yuvipanda commented 7 years ago

So I did more stuff around this in the last couple of days.

I switched to a kubernetes ingress provider for proxying. This is great, but a bit racy still. Hopefully we'll tackle all of it
I did a performance profile (using pyflame) and it looks like we're spending a good chunk of CPU on auth related methods. There's ways to optimize there.
I'm currently trying out pypy to see if it gives us a boost!

yuvipanda commented 7 years ago

I also filed https://github.com/jupyterhub/jupyterhub/issues/1253 - that's the most CPU intensive hotspot we've found. Plus there are lots of well tested libraries for this we should use.

yuvipanda commented 7 years ago

Filed https://github.com/kubernetes-client/python-base/issues/23 for issues with kubernetes api-client.

Update: It's most likely a PyPy bug: https://bitbucket.org/pypy/pypy/issues/2615/this-looks-like-a-thread-safety-bug

yuvipanda commented 7 years ago

We can offload TLS to a kubectl proxy instance that's running as a sidecar container, providing a http endpoint that's bound to localhost. I've filed https://github.com/kubernetes-incubator/client-python/issues/303 on figuring out how to do this.

yuvipanda commented 7 years ago

Me and @minrk tracked down the SQLAlchemy issues to https://github.com/jupyterhub/jupyterhub/pull/1257 - and it's improved performance considerably!

hashlib is next step!

yuvipanda commented 7 years ago

More notes is:

We have lots of scaling things to do that are dependent on 'number of spawns pending'.
We're using threads that do IO now, and are depending on dicts updated by those in our main thread. Figure out how to profile GIL contention.
Try to reduce the amount of busy looping that we do in the proxy & spawner, and implement exponential backoffs instead of 'gen.sleep(1)'.
Figure out if we really can stop salting our hashes if they're from UUIDs - or at least stop salting the specific hashes that come from UUIDs. The prefix based scheme might be part of our problem.
Move all sqlalchemy operations to a single external thread (not a threadpool probably)

minrk commented 7 years ago

Move all sqlalchemy operations to a single external thread (not a threadpool probably)

This is probably the hardest one, because we have to make sure that we bundle db operations within a single synchronous function that we pass to a thread, whereas right now we only have to ensure that we commit in between yields.

yuvipanda commented 7 years ago

Additional point - make sure that each tornado tic isn't doing too much work. We want it to be as evenly spread out as possible - so we don't get tocs that are too long followed by tocs that are too small. Am implementing some jitter into the exponential backoff algorithm for this now...

yuvipanda commented 6 years ago

We fixed a ton of stuff doing this! YAY!

nnbaokhang commented 4 years ago

Sorry but how do we run the stress test. I don't think I see any tutorial yet. Correct me if I'm wrong.

jupyterhub / helm-chart

Stress test 5000 concurrent users #46