Potential issues when using Dask-Gateway with multiple simultaneous users

nebari-dev / nebari

🪴 Nebari - your open source data science platform

https://nebari.dev

BSD 3-Clause "New" or "Revised" License

283 stars 93 forks source link

Potential issues when using Dask-Gateway with multiple simultaneous users #1750

Closed iameskild closed 1 year ago

iameskild commented 1 year ago

During the recent PyCon Nebari tutorial, we had 30+ people trying to connect to the Dask-Gateway cluster at the same time. Some users were able to connect and others were not. Those who were not able to connect ran into the following error message:

ClientConnectorError: Cannot connect to host nebari-dask-gateway-gateway-api.dev:8000 ssl:default [Connect call failed ('10.35.246.207', 8000)]

Based on the CPU/memory limits the Dask-Gateway pod has by default, it might be a resource limitation. Another possible guess is that the Dask-Gateway API was simply overwhelmed. We need to investigate further to isolate what the root cause of this issue might be.

dharhas commented 1 year ago

@iameskild @rsignell-usgs

What does the ESIPFed cluster use for the dask-gateway pod? Is it the same as default? Have they seen this issue before?

rsignell-usgs commented 1 year ago

Thanks for the heads up on this issue @dharhas. We have not seen this issue on the ESIP Nebari deployment, but we also haven't had 50 people all try to launch a cluster at the same time. I thought I remembered that someone (the Berkeley Jupyter team?) tested with ~1000 users, all with Dask clusters on Dask Gateway though. Perhaps I'm mistaken?

Does this ring a bell @yuvipanda ?

The configuration for the Nebari deployment for ESIP is:

  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 1
    user:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 100
    worker:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 450

dharhas commented 1 year ago

I'm pretty sure Dask Gateway can handle the load. I assume we have a bad configuration/undersized pod or similar. @iameskild said that the pod kept rebooting.

Also as an fyi, that wasn't the only error message. At various time retrieving options, getting a client etc all failed. i.e. the API was either erroring out or unresponsive as folks were hitting it.

iameskild commented 1 year ago

The worker node group is only for the dask-worker and dask-scheduler and our deployment has roughly the same instance type (though on GCP) that @rsignell-usgs shared above.

I think we should try and recreate the issue and then see what happens when we increase the dask-gateway resource limits. The default dask-gateway pod resources:

https://github.com/nebari-dev/nebari/blob/37226b7d9d1e727892dfd60c76659498b4679fe2/nebari/template/stages/07-kubernetes-services/modules/kubernetes/services/dask-gateway/gateway.tf#L206-L215

dharhas commented 1 year ago

wow that seems pretty tiny.

costrouc commented 1 year ago

My initial thoughts as to why this is happening. Dask-gateway should not require significant resources. All dask-gateway is responsible for:

authenticate user
call Kubernetes api
check available conda environments

Currently the "check available conda environments" crawls through the filesystem to get available environments. We should be using conda-store server api to get the available environments and this would reduce the load on the server.

costrouc commented 1 year ago

An example of this can be seen in the cdsdashboards code https://github.com/nebari-dev/nebari/blob/develop/nebari/template/stages/07-kubernetes-services/modules/kubernetes/services/jupyterhub/files/jupyterhub/02-spawner.py#L55

viniciusdc commented 1 year ago

An example of this can be seen in the cdsdashboards code https://github.com/nebari-dev/nebari/blob/develop/nebari/template/stages/07-kubernetes-services/modules/kubernetes/services/jupyterhub/files/jupyterhub/02-spawner.py#L55

In that sense, we would be updating the inner logic of Dask-Gateway to retrieve the envs from requests to the Conda-store API. What would be the endpoint? is it public?

pavithraes commented 1 year ago

Just to note, this is a super high priority, and we need this by 10th May.