Open Adam-D-Lewis opened 4 months ago
While I don't think this is the issue, it occurs to me that the other nodes might be scaling up b/c we have more pods than cpu/memory on the general node.
While I don't think this is the issue, it occurs to me that the other nodes might be scaling up b/c we have more pods than cpu/memory on the general node.
That's a good point, we relly need to check out those taints
I think as an overall change, your 2 points seems more reasonable (to all providers). For the AWS specifically, I think the metrics is a service that you need to enable if you want to use, and costs an extra expense to keep. I also agree to disable it in such case, or make it optional....
also I think the GKE deployed kubedns replicaset has the same issue. I think the solution is to put taints on the user nodes and worker nodes.
I also saw the metrics server and jupyterhub's user scheduler cause the same problem on AKS.
The solution I propose is to add a taints section to each node group class. Thus you could specify the a taint on the user node via something like the following:
node_groups:
user:
instance: Standard_D4_v3
taints:
- dedicated=user:NoSchedule
Then, we go and make sure the corresponding toleration is added to the jupyterhub user pod so that those pods will be able to run on the user node group. This should also work with pods started via argo-jupyter-scheduler.
This would not be supported for local deployments since local deployments only deploy a single node cluster atm. For existing deployments, it wouldn't affect the node group, but we would apply the specified toleration to the jupyterlab user pod.
Describe the bug
I noticed that GKE won't autoscale all nodes down to 0 in some cases. I saw that metrics-server deployment and the event-exporter-gke replicaset nodeSelector only has
meaning it can be scheduled on any of the nodes preventing them from scaling down.
Options to fix this might be
I think AWS doesn't have metrics-server enabled by default so I think it's reasonable to disable it.
Expected behavior
nodes should autoscale down
OS and architecture in which you are running Nebari
Linux x86-64
How to Reproduce the problem?
see above
Command output
No response
Versions and dependencies used.
No response
Compute environment
GCP
Integrations
No response
Anything else?
No response