nebari-dev / nebari

ðŸŠī Nebari - your open source data science platform
https://nebari.dev
BSD 3-Clause "New" or "Revised" License
282 stars 93 forks source link

[BUG] - Dask Scheduler pod stays pending #1683

Open rsignell-usgs opened 1 year ago

rsignell-usgs commented 1 year ago

Describe the bug

When I try to start a Dask Gateway Cluster on our Nebari v2023.1.1 deployment, the Dask scheduler pod remains pending:

2023-03-18_06-37-36

If I look at the dask-scheduler pod YAML in k9s, it says:

2023-03-18_10-10-43

Could this be the availability zone problem again?

What can I do to fix or troubleshoot?

Expected behavior

dask cluster starts

OS and architecture in which you are running Nebari

Linux, AWS

How to Reproduce the problem?

start a cluster on https://nebari.esipfed.org (@iameskild and @viniciusdc have access)

Command output

No response

Versions and dependencies used.

nebari 2023.1.1

$ kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.25.3
Kustomize Version: v4.5.7
Server Version: v1.23.14-eks-ffeb93d
$ conda --version
conda 22.11.1

Compute environment

AWS

Integrations

No response

Anything else?

No response

rsignell-usgs commented 1 year ago

Whoa! I was able to fix this myself using the same approach @iameskild and I used yesterday to fix the general nodegroup. I used the AWS console to go to edit the network settings on the autoscaling group for each instance, setting them to be only us-west-2b to match the volumes (instead of us-west-2a and us-west2b. Then I rebooted the instances, and presto, Dask Gateway is working again!
2023-03-18_10-26-35 2023-03-18_10-21-03

costrouc commented 1 year ago

@iameskild this convinces me that single zone should be default for new inits. But for backwards compatible we cannot make this the default since it would trigger a redeploy of the aks cluster

iameskild commented 1 year ago

Thanks for the update @rsignell-usgs!

@costrouc I'm surprised that this was an issue for Rich. The general node group appeared to be in both subsets (just as the worker node group is in the above screenshot) and therefore the volumes should have no trouble mounting so long they are in one of the two zones (which they were).

The reason I didn't set the single zone as the default is for exactly the reason you stated, this would trigger a redeploy of the entire cluster.

rsignell commented 11 months ago

@costrouc, @aktech, @viniciusdc & @marcelovilla, the ESIP Nebari deployment is having this issue again of dask scheduler hanging in the "pending" state:

scheduler

with the same YAML status message as before:

status

I tried reducing the subregions in the autoscaling down to a single AZ (the same AZ the instances and volumes are in), and rebooting the "Nebari1" and "Nebari3" instances, but I'm still getting hanging scheduler.
Screenshot 2023-12-28 152743

I'm guessing there are some Kube pods I need to kill/restart as well?

viniciusdc commented 11 months ago

Hi @rsignell just to cross out the possibility, can you also print the log from the dask-controller pod?

rsignell commented 11 months ago

┌─────── Logs(dev/nebari-daskgateway-controller-5bc8c57dc4-r5nb5:nebari-daskgateway-controller)[tail] ───────┐│ Autoscroll:On FullScreen:Off Timestamps:Off Wrap:Off ││ [I 2023-12-28 20:42:29.890 KubeController] Reconciling cluster dev.25b265c425864501a7d23e0a9e3d1928 │ │ [I 2023-12-28 20:42:29.891 KubeController] Finished reconciling cluster dev.25b265c425864501a7d23e0a9e3d19 │ │ [I 2023-12-28 20:48:32.627 KubeController] Removing 1 expired cluster records │ │ [I 2023-12-28 20:58:32.643 KubeController] Removing 2 expired cluster records │ │ [I 2023-12-28 22:18:32.681 KubeController] Removing 2 expired cluster records │ │ [I 2023-12-29 17:28:33.067 KubeController] Removing 2 expired cluster records │ │ [I 2023-12-29 17:38:33.089 KubeController] Removing 1 expired cluster records │ │ [I 2023-12-29 20:06:19.582 KubeController] Reconciling cluster dev.0c9ddceb31864dd789b698dd97161b1b │ │ [I 2023-12-29 20:06:19.675 KubeController] Creating new credentials for cluster dev.0c9ddceb31864dd789b698 │ │ [I 2023-12-29 20:06:19.690 KubeController] Creating scheduler pod for cluster dev.0c9ddceb31864dd789b698dd │ │ [I 2023-12-29 20:06:19.725 KubeController] Finished reconciling cluster dev.0c9ddceb31864dd789b698dd97161b │ │ [I 2023-12-29 20:06:19.725 KubeController] Reconciling cluster dev.0c9ddceb31864dd789b698dd97161b1b │ │ [I 2023-12-29 20:06:19.725 KubeController] Finished reconciling cluster dev.0c9ddceb31864dd789b698dd97161b

rsignell commented 10 months ago

@dharhas the ESIP Nebari deployment has been broken on this all week (e.g. Dask clusters do not start) 😞 Might we get some help to get back on track?

rsignell commented 10 months ago

Okay, we seem to be back in business on the ESIP Nebari deployment! @aktech had an idea to try modifying the existing config:

node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 1
    user:
      instance: m5.xlarge
      min_nodes: 1
      max_nodes: 100
    worker:
      instance: m5.xlarge
      min_nodes: 0    <=======
      max_nodes: 450

to temporarily specify the worker min_nodes:1 to force the worker autoscaler to start.

That unfortunately failed, with the github action error:

Error: error updating EKS Node Group (nebari-bamboo-dev:worker) config: InvalidRequestException: Nodegroup health has issues other than [ AsgInstanceLaunchFailures, InstanceLimitExceeded, InsufficientFreeAddresses, ClusterUnreachable, Ec2LaunchTemplateVersionMismatch ]

BUT the concept worked -- I was able to go into the AWS console, search "ec2", then "autoscaling", then to the "eks-worker-xxx" group and edited the "Group Details" changing "desired capacity" from 0 to 1.

That allowed the Dask scheduler to start, which then fired up the dask cluster!

I then stopped the cluster and then started again, and the autoscaler properly scaled back to 0 and then back up again.

Does this make sense to folks?

rsignell commented 4 months ago

Folks, the ESIP Nebari deployment is running, but I can't upgrade or make any changes to the system. I tried upgrading to 2024.6.1 just now, and the deployment failed because the EKS cluster has a health issue:

[terraform]:
[terraform]: Plan: 1 to add, 1 to change, 0 to destroy.
[terraform]:
[terraform]: Changes to Outputs:
[terraform]:   ~ kubernetes_credentials  = (sensitive value)
[terraform]: local_file.kubeconfig[0]: Creating...
[terraform]: local_file.kubeconfig[0]: Creation complete after 0s [id=abed5559bddb06f549cdffb5e62032d77d7a584f]
[terraform]: module.kubernetes.aws_eks_node_group.main[0]: Modifying... [id=nebari-bamboo-dev:general]
[terraform]: ╷
[terraform]: │ Error: updating EKS Node Group (nebari-bamboo-dev:general) config: operation error EKS: UpdateNodegroupConfig, https response error StatusCode: 400, RequestID: dfe29008-680d-43cb-99d1-9933f430aedb, InvalidRequestException: Nodegroup health has issues other than [ AsgInstanceLaunchFailures, InstanceLimitExceeded, InsufficientFreeAddresses, ClusterUnreachable, Ec2LaunchTemplateVersionMismatch ]
[terraform]: │

The health issue is because the general nodegroup only has one subnet specified, but it expected two: image

But we needed to force the general nodegroup to use only the one subnet because when it had two, it was a different subnet than the volumes, and things were not working.

What should be my next steps here?

Adam-D-Lewis commented 4 months ago

Any ideas @marcelovilla @viniciusdc

marcelovilla commented 4 months ago

Hey @rsignell,

You can try the following:

  1. Add the expected subnet to the Autoscaling group of the node group with the failing health check.
  2. Upgrade and re-deploy
  3. (Optionally) remove the subnet added on (1) so you don't encounter the issue with the general node spawning in a different AZ than the volumes when scaling that group node or when upgrading your k8s version.

We're working of fixing this from our side but these steps should allow you to run the upgrade without issues.

rsignell commented 4 months ago

Okay I'll try that and report back!

rsignell commented 4 months ago

Finally got around to trying this! I went to the autoscaling group, and then then the networking section, and added the subnet in us-west-2b. I also noticed while in the console that the max of the autoscaling group was set to 5, so I set it to 1 (I have it set to 1 in the nebari-config.yaml I was deploying also).

The deployment went fine, and the general node is still on us-west-2a, same as the EBS volumes. Since I've pinned the general node to 1, does that mean I don't need to delete the us-west-2b subnet from the autoscaling group, since there is no way the general node will start there (unless it get's killed -- does it ever get killed though?)

marcelovilla commented 4 months ago

@rsignell sorry for the late reply; I was out the last two weeks.

I'd leave the us-west-2b subnet as it is for the time being because otherwise you'll get a healtcheck error for the node group, which will result in errors when upgrading the k8s version. That being said, until we fix the underlying issue, there's still the possibility that the node will try to re-spawn in another available subnet after upgrading the k8s version.

rsignell commented 4 months ago

I did this and got lucky -- landed in the right subnet!