Open rsignell-usgs opened 1 year ago
Whoa! I was able to fix this myself using the same approach @iameskild and I used yesterday to fix the general nodegroup. I used the AWS console to go to edit the network settings on the autoscaling group for each instance, setting them to be only us-west-2b
to match the volumes (instead of us-west-2a
and us-west2b
. Then I rebooted the instances, and presto, Dask Gateway is working again!
@iameskild this convinces me that single zone should be default for new inits. But for backwards compatible we cannot make this the default since it would trigger a redeploy of the aks cluster
Thanks for the update @rsignell-usgs!
@costrouc I'm surprised that this was an issue for Rich. The general
node group appeared to be in both subsets (just as the worker
node group is in the above screenshot) and therefore the volumes should have no trouble mounting so long they are in one of the two zones (which they were).
The reason I didn't set the single zone as the default is for exactly the reason you stated, this would trigger a redeploy of the entire cluster.
@costrouc, @aktech, @viniciusdc & @marcelovilla, the ESIP Nebari deployment is having this issue again of dask scheduler hanging in the "pending" state:
with the same YAML status message as before:
I tried reducing the subregions in the autoscaling down to a single AZ (the same AZ the instances and volumes are in), and rebooting the "Nebari1" and "Nebari3" instances, but I'm still getting hanging scheduler.
I'm guessing there are some Kube pods I need to kill/restart as well?
Hi @rsignell just to cross out the possibility, can you also print the log from the dask-controller
pod?
ââââââââ Logs(dev/nebari-daskgateway-controller-5bc8c57dc4-r5nb5:nebari-daskgateway-controller)[tail] âââââââââ Autoscroll:On FullScreen:Off Timestamps:Off Wrap:Off ââ [I 2023-12-28 20:42:29.890 KubeController] Reconciling cluster dev.25b265c425864501a7d23e0a9e3d1928 â â [I 2023-12-28 20:42:29.891 KubeController] Finished reconciling cluster dev.25b265c425864501a7d23e0a9e3d19 â â [I 2023-12-28 20:48:32.627 KubeController] Removing 1 expired cluster records â â [I 2023-12-28 20:58:32.643 KubeController] Removing 2 expired cluster records â â [I 2023-12-28 22:18:32.681 KubeController] Removing 2 expired cluster records â â [I 2023-12-29 17:28:33.067 KubeController] Removing 2 expired cluster records â â [I 2023-12-29 17:38:33.089 KubeController] Removing 1 expired cluster records â â [I 2023-12-29 20:06:19.582 KubeController] Reconciling cluster dev.0c9ddceb31864dd789b698dd97161b1b â â [I 2023-12-29 20:06:19.675 KubeController] Creating new credentials for cluster dev.0c9ddceb31864dd789b698 â â [I 2023-12-29 20:06:19.690 KubeController] Creating scheduler pod for cluster dev.0c9ddceb31864dd789b698dd â â [I 2023-12-29 20:06:19.725 KubeController] Finished reconciling cluster dev.0c9ddceb31864dd789b698dd97161b â â [I 2023-12-29 20:06:19.725 KubeController] Reconciling cluster dev.0c9ddceb31864dd789b698dd97161b1b â â [I 2023-12-29 20:06:19.725 KubeController] Finished reconciling cluster dev.0c9ddceb31864dd789b698dd97161b
@dharhas the ESIP Nebari deployment has been broken on this all week (e.g. Dask clusters do not start) ð Might we get some help to get back on track?
Okay, we seem to be back in business on the ESIP Nebari deployment! @aktech had an idea to try modifying the existing config:
node_groups:
general:
instance: m5.2xlarge
min_nodes: 1
max_nodes: 1
user:
instance: m5.xlarge
min_nodes: 1
max_nodes: 100
worker:
instance: m5.xlarge
min_nodes: 0 <=======
max_nodes: 450
to temporarily specify the worker min_nodes:1
to force the worker autoscaler to start.
That unfortunately failed, with the github action error:
Error: error updating EKS Node Group (nebari-bamboo-dev:worker) config: InvalidRequestException: Nodegroup health has issues other than [ AsgInstanceLaunchFailures, InstanceLimitExceeded, InsufficientFreeAddresses, ClusterUnreachable, Ec2LaunchTemplateVersionMismatch ]
BUT the concept worked -- I was able to go into the AWS console, search "ec2", then "autoscaling", then to the "eks-worker-xxx" group and edited the "Group Details" changing "desired capacity" from 0 to 1.
That allowed the Dask scheduler to start, which then fired up the dask cluster!
I then stopped the cluster and then started again, and the autoscaler properly scaled back to 0 and then back up again.
Does this make sense to folks?
Folks, the ESIP Nebari deployment is running, but I can't upgrade or make any changes to the system. I tried upgrading to 2024.6.1 just now, and the deployment failed because the EKS cluster has a health issue:
[terraform]:
[terraform]: Plan: 1 to add, 1 to change, 0 to destroy.
[terraform]:
[terraform]: Changes to Outputs:
[terraform]: ~ kubernetes_credentials = (sensitive value)
[terraform]: local_file.kubeconfig[0]: Creating...
[terraform]: local_file.kubeconfig[0]: Creation complete after 0s [id=abed5559bddb06f549cdffb5e62032d77d7a584f]
[terraform]: module.kubernetes.aws_eks_node_group.main[0]: Modifying... [id=nebari-bamboo-dev:general]
[terraform]: â·
[terraform]: â Error: updating EKS Node Group (nebari-bamboo-dev:general) config: operation error EKS: UpdateNodegroupConfig, https response error StatusCode: 400, RequestID: dfe29008-680d-43cb-99d1-9933f430aedb, InvalidRequestException: Nodegroup health has issues other than [ AsgInstanceLaunchFailures, InstanceLimitExceeded, InsufficientFreeAddresses, ClusterUnreachable, Ec2LaunchTemplateVersionMismatch ]
[terraform]: â
The health issue is because the general
nodegroup only has one subnet specified, but it expected two:
But we needed to force the general
nodegroup to use only the one subnet because when it had two, it was a different subnet than the volumes, and things were not working.
What should be my next steps here?
Any ideas @marcelovilla @viniciusdc
Hey @rsignell,
You can try the following:
We're working of fixing this from our side but these steps should allow you to run the upgrade without issues.
Okay I'll try that and report back!
Finally got around to trying this! I went to the autoscaling group, and then then the networking section, and added the subnet in us-west-2b. I also noticed while in the console that the max of the autoscaling group was set to 5, so I set it to 1 (I have it set to 1 in the nebari-config.yaml I was deploying also).
The deployment went fine, and the general node is still on us-west-2a, same as the EBS volumes. Since I've pinned the general node to 1, does that mean I don't need to delete the us-west-2b subnet from the autoscaling group, since there is no way the general node will start there (unless it get's killed -- does it ever get killed though?)
@rsignell sorry for the late reply; I was out the last two weeks.
I'd leave the us-west-2b
subnet as it is for the time being because otherwise you'll get a healtcheck error for the node group, which will result in errors when upgrading the k8s version. That being said, until we fix the underlying issue, there's still the possibility that the node will try to re-spawn in another available subnet after upgrading the k8s version.
I did this and got lucky -- landed in the right subnet!
Describe the bug
When I try to start a Dask Gateway Cluster on our Nebari v2023.1.1 deployment, the Dask scheduler pod remains pending:
If I look at the dask-scheduler pod YAML in k9s, it says:
Could this be the availability zone problem again?
What can I do to fix or troubleshoot?
Expected behavior
dask cluster starts
OS and architecture in which you are running Nebari
Linux, AWS
How to Reproduce the problem?
start a cluster on https://nebari.esipfed.org (@iameskild and @viniciusdc have access)
Command output
No response
Versions and dependencies used.
nebari 2023.1.1
Compute environment
AWS
Integrations
No response
Anything else?
No response