nebari-dev / nebari

🪴 Nebari - your open source data science platform
https://nebari.dev
BSD 3-Clause "New" or "Revised" License
279 stars 91 forks source link

[BUG] - Nodes don't scale down on GKE and AKS #2507

Open Adam-D-Lewis opened 4 months ago

Adam-D-Lewis commented 4 months ago

Describe the bug

I noticed that GKE won't autoscale all nodes down to 0 in some cases. I saw that metrics-server deployment and the event-exporter-gke replicaset nodeSelector only has

nodeSelector:                                                                                                                                                                          
    kubernetes.io/os: linux                                                                                                                                                              

meaning it can be scheduled on any of the nodes preventing them from scaling down.

Options to fix this might be

  1. Disable metrics collection - https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#enable_components
  2. Set a taint on user and worker nodes (and any custom nodes groups created) to force metrics-server pod to run on general node group

I think AWS doesn't have metrics-server enabled by default so I think it's reasonable to disable it.

Expected behavior

nodes should autoscale down

OS and architecture in which you are running Nebari

Linux x86-64

How to Reproduce the problem?

see above

Command output

No response

Versions and dependencies used.

No response

Compute environment

GCP

Integrations

No response

Anything else?

No response

Adam-D-Lewis commented 4 months ago

While I don't think this is the issue, it occurs to me that the other nodes might be scaling up b/c we have more pods than cpu/memory on the general node.

viniciusdc commented 4 months ago

While I don't think this is the issue, it occurs to me that the other nodes might be scaling up b/c we have more pods than cpu/memory on the general node.

That's a good point, we relly need to check out those taints

viniciusdc commented 4 months ago

I think as an overall change, your 2 points seems more reasonable (to all providers). For the AWS specifically, I think the metrics is a service that you need to enable if you want to use, and costs an extra expense to keep. I also agree to disable it in such case, or make it optional....

Adam-D-Lewis commented 4 months ago

also I think the GKE deployed kubedns replicaset has the same issue. I think the solution is to put taints on the user nodes and worker nodes.

Adam-D-Lewis commented 3 months ago

I also saw the metrics server and jupyterhub's user scheduler cause the same problem on AKS.

Adam-D-Lewis commented 2 months ago

The solution I propose is to add a taints section to each node group class. Thus you could specify the a taint on the user node via something like the following:

  node_groups:
    user:
      instance: Standard_D4_v3
      taints:
        - dedicated=user:NoSchedule

Then, we go and make sure the corresponding toleration is added to the jupyterhub user pod so that those pods will be able to run on the user node group. This should also work with pods started via argo-jupyter-scheduler.

This would not be supported for local deployments since local deployments only deploy a single node cluster atm. For existing deployments, it wouldn't affect the node group, but we would apply the specified toleration to the jupyterlab user pod.