Open sylus opened 4 years ago
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
area/jupyter | 0.89 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.
We experienced this behavior also with a smaller number (<15) of profiles.
Thanks for the detailed report @sylus!
We should indeed move to gunicorn
for serving the app instead of the Flask server.
What about increasing the number of replicas the jupyter web app has?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
/lifecycle frozen
@kimwnasptd Is it an idea to look into possibly moving the notebooks API to the Go version created over at https://github.com/StatCan/jupyter-apis after release 1.3 when the testing is hopefully all running?
Hey everyone, I've done some progress on this and have a PR that implements caching in the backends https://github.com/kubeflow/kubeflow/pull/7080. This should initially help with the load in the backend, since it will not need to perform requests to K8s and will have its own cache.
A next step afterwards will be to extend the frontends to keep on polling but by doing some proper pagination with the backend.
For example now that the backend always has the full list of objects in-memory it can answer page requests like: I want the 3rd page where each page has 20 items.
/transfer notebooks
/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.]
Hello amazing kubeflow community.
Apologies in advance if I struggle to explain this issue properly but here is my attempt.
It seems that the longer the Jupyter Web App Flask application is running and the more frequently it is queried against that this increases the likelihood of subsequent API calls failing. This has resulted in people being unable to submit their workload and have certain fields populated.
Often the API queries first go into pending and then fail quite a few minutes later are:
We do notice
env-info
also taking a long time but there is a corresponding and seperate issue for that and this call always seems to return.In the following picture it shows we are unable to open the configurations drop down which is populated from the PodDefaults and that two API calls are pending that will ultimately fail. (A subsequent log with stack trace once failure happens is also attached).
jupyter-web-app.txt
In the stack trace we get the following messages which come from the
kubernetes-client/python
python library.What did you expect to happen:
All of the API calls to succeed quickly and not result in pending leading to timeout.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
We have approximately 125 users in our kubeflow environment and do populate a PodDefault for each user. All of the API calls are fast and work for roughly an hour (sometimes less) whenever we restart the Jupyter Web App.
It should also be noted that for the first 2 months of operations of our kubeflow cluster we didn't seem to experience any of these issues which leads us to assume this might be a scaling issue? Though we absolutely don't rule out something we may have done.
As a further debugging step we have made some slight adjustments to the Jupyter Web App (Flask) application such as running it behind gunicorn (performing minor adjustments in our fork) but the problem still seems to arise (though a noticable improvement in latency).
https://github.com/StatCan/kubeflow/commit/d2a2f5fa33812ea524a5754a2c0226057549573f
Finally at present we have started to rewrite all of the Jupyter API requests in a Go backend and see if we get the same problems. It seems at the moment we don't have any issues whatsoever with the Go backend and also set it up so uses a stream from the APi server rather then making API calls all the times as it has a local cache for everything. The only API call this makes on reuqests is checking your authentication.
https://github.com/StatCan/jupyter-apis (Go Backend)
Environment:
kfctl version
): 1.0.1minikube
) AKSkubectl version
): 1.15.10/etc/os-release
): 16.04