Closed iameskild closed 1 year ago
@iameskild @rsignell-usgs
What does the ESIPFed cluster use for the dask-gateway pod? Is it the same as default? Have they seen this issue before?
Thanks for the heads up on this issue @dharhas. We have not seen this issue on the ESIP Nebari deployment, but we also haven't had 50 people all try to launch a cluster at the same time. I thought I remembered that someone (the Berkeley Jupyter team?) tested with ~1000 users, all with Dask clusters on Dask Gateway though. Perhaps I'm mistaken?
Does this ring a bell @yuvipanda ?
The configuration for the Nebari deployment for ESIP is:
node_groups:
general:
instance: m5.2xlarge
min_nodes: 1
max_nodes: 1
user:
instance: m5.2xlarge
min_nodes: 1
max_nodes: 100
worker:
instance: m5.2xlarge
min_nodes: 1
max_nodes: 450
I'm pretty sure Dask Gateway can handle the load. I assume we have a bad configuration/undersized pod or similar. @iameskild said that the pod kept rebooting.
Also as an fyi, that wasn't the only error message. At various time retrieving options, getting a client etc all failed. i.e. the API was either erroring out or unresponsive as folks were hitting it.
The worker
node group is only for the dask-worker and dask-scheduler and our deployment has roughly the same instance type (though on GCP) that @rsignell-usgs shared above.
I think we should try and recreate the issue and then see what happens when we increase the dask-gateway resource limits. The default dask-gateway pod resources:
wow that seems pretty tiny.
My initial thoughts as to why this is happening. Dask-gateway should not require significant resources. All dask-gateway is responsible for:
Currently the "check available conda environments" crawls through the filesystem to get available environments. We should be using conda-store server api to get the available environments and this would reduce the load on the server.
An example of this can be seen in the cdsdashboards code https://github.com/nebari-dev/nebari/blob/develop/nebari/template/stages/07-kubernetes-services/modules/kubernetes/services/jupyterhub/files/jupyterhub/02-spawner.py#L55
An example of this can be seen in the cdsdashboards code https://github.com/nebari-dev/nebari/blob/develop/nebari/template/stages/07-kubernetes-services/modules/kubernetes/services/jupyterhub/files/jupyterhub/02-spawner.py#L55
In that sense, we would be updating the inner logic of Dask-Gateway to retrieve the envs from requests to the Conda-store API. What would be the endpoint? is it public?
Just to note, this is a super high priority, and we need this by 10th May.
During the recent PyCon Nebari tutorial, we had 30+ people trying to connect to the Dask-Gateway cluster at the same time. Some users were able to connect and others were not. Those who were not able to connect ran into the following error message:
Based on the CPU/memory limits the Dask-Gateway pod has by default, it might be a resource limitation. Another possible guess is that the Dask-Gateway API was simply overwhelmed. We need to investigate further to isolate what the root cause of this issue might be.