Closed ajsalow closed 2 years ago
I think this is just us not deduping the taint your provide. Can you try removing
taints:
- nvidia.com/gpu:NoSchedule
kOps already adds this.
I'm experiencing the same issue. The only way I can resolve it is by deleting my GPU node IG and remaking it every time I want to make edits.
The same error from kubelet and you can confirm that you have not set the above taint in the manifest directly?
The same error from kubelet and you can confirm that you have not set the above taint in the manifest directly?
Yes. I normally spin up my cluster to do some dev work, and switch it off afterwards. (By off I mean draining my nodes and setting minSize and maxSize to zero). Every time I spin up my cluster I need to recreate my gpu-nodes ig, otherwise I run into the same problem. The taint gets added after every edit/save and the node won't join the cluster. Usually it looks something like this:
taints:
- nvidia.com/gpu:NoSchedule
And once edited and saved:
taints:
- nvidia.com/gpu:NoSchedule
- nvidia.com/gpu:NoSchedule
Try removing both entries
I've tried that, but then the taint is re-added after the save, unfortunately.
I have an idea what is happening here ... /assign
I'm experiencing the same issue here. Do we have any progress on this?
You are using kops 1.23.0 right? Can you open a new issue?
Sure, please see this: https://github.com/kubernetes/kops/issues/13468
/kind bug
1. What
kops
version are you running? The commandkops version
, will display this information. Version 1.22.3 (git-241bfeba5931838fd32f2260aff41dd89a585fba)2. What Kubernetes version are you running?
kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag.Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:10:22Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using? AWS gov cloud. 4. What commands did you run? What is the simplest way to reproduce this issue? Create gpu instance group following https://github.com/kubernetes/kops/blob/master/docs/gpu.md.
I had to change the ami because the ami in the gpu.md is not available in gov cloud.
5. What happened after the commands executed? Node will not join cluster. Journalctl logs from node below. One thing I find odd is it keeps adding the taint nvidia.com/gpu:NoSchedule multiple times.
6. What did you expect to happen? Node to join cluster.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest. You may want to remove your cluster name and other sensitive information.*