kubeflow / notebooks

Kubeflow Notebooks lets you run web-based development environments on your Kubernetes cluster by running them inside Pods.
Apache License 2.0
18 stars 22 forks source link

Possible multi-threading issue w/Jupyter Web App (kubernetes-client/python) #114

Open sylus opened 4 years ago

sylus commented 4 years ago

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.]

Hello amazing kubeflow community.

Apologies in advance if I struggle to explain this issue properly but here is my attempt.

It seems that the longer the Jupyter Web App Flask application is running and the more frequently it is queried against that this increases the likelihood of subsequent API calls failing. This has resulted in people being unable to submit their workload and have certain fields populated.

Often the API queries first go into pending and then fail quite a few minutes later are:

We do notice env-info also taking a long time but there is a corresponding and seperate issue for that and this call always seems to return.

In the following picture it shows we are unable to open the configurations drop down which is populated from the PodDefaults and that two API calls are pending that will ultimately fail. (A subsequent log with stack trace once failure happens is also attached).

jupyter-web-app jupyter-web-app.txt

In the stack trace we get the following messages which come from the kubernetes-client/python python library.

2020-08-11 00:02:39,869 | kubeflow_jupyter.default.app | ERROR | Exception on /api/namespaces/USERNAME/poddefaults [GET]
...
File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api/authorization_v1_api.py", line 389, in create_subject_access_review
...
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed out. (read timeout=None)

What did you expect to happen:

All of the API calls to succeed quickly and not result in pending leading to timeout.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

We have approximately 125 users in our kubeflow environment and do populate a PodDefault for each user. All of the API calls are fast and work for roughly an hour (sometimes less) whenever we restart the Jupyter Web App.

kubectl rollout restart deploy -n kubeflow jupyter-web-app-deployment

It should also be noted that for the first 2 months of operations of our kubeflow cluster we didn't seem to experience any of these issues which leads us to assume this might be a scaling issue? Though we absolutely don't rule out something we may have done.

As a further debugging step we have made some slight adjustments to the Jupyter Web App (Flask) application such as running it behind gunicorn (performing minor adjustments in our fork) but the problem still seems to arise (though a noticable improvement in latency).

https://github.com/StatCan/kubeflow/commit/d2a2f5fa33812ea524a5754a2c0226057549573f

Finally at present we have started to rewrite all of the Jupyter API requests in a Go backend and see if we get the same problems. It seems at the moment we don't have any issues whatsoever with the Go backend and also set it up so uses a stream from the APi server rather then making API calls all the times as it has a local cache for everything. The only API call this makes on reuqests is checking your authentication.

https://github.com/StatCan/jupyter-apis (Go Backend)

Environment:

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/jupyter 0.89

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] commented 4 years ago

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

DanielSCon40 commented 4 years ago

We experienced this behavior also with a smaller number (<15) of profiles.

kimwnasptd commented 4 years ago

Thanks for the detailed report @sylus!

We should indeed move to gunicorn for serving the app instead of the Flask server.

What about increasing the number of replicas the jupyter web app has?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sylus commented 3 years ago

/remove-lifecycle stale

yanniszark commented 3 years ago

/lifecycle frozen

davidspek commented 3 years ago

@kimwnasptd Is it an idea to look into possibly moving the notebooks API to the Go version created over at https://github.com/StatCan/jupyter-apis after release 1.3 when the testing is hopefully all running?

kimwnasptd commented 1 year ago

Hey everyone, I've done some progress on this and have a PR that implements caching in the backends https://github.com/kubeflow/kubeflow/pull/7080. This should initially help with the load in the backend, since it will not need to perform requests to K8s and will have its own cache.

A next step afterwards will be to extend the frontends to keep on polling but by doing some proper pagination with the backend.

For example now that the backend always has the full list of objects in-memory it can answer page requests like: I want the 3rd page where each page has 20 items.

andreyvelich commented 2 weeks ago

/transfer notebooks