kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.9k stars 4.65k forks source link

GPU AWS Gov cloud, node will not join cluster. #13116

Closed ajsalow closed 2 years ago

ajsalow commented 2 years ago

/kind bug

1. What kops version are you running? The command kops version, will display this information. Version 1.22.3 (git-241bfeba5931838fd32f2260aff41dd89a585fba)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:10:22Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using? AWS gov cloud. 4. What commands did you run? What is the simplest way to reproduce this issue? Create gpu instance group following https://github.com/kubernetes/kops/blob/master/docs/gpu.md.

I had to change the ami because the ami in the gpu.md is not available in gov cloud.

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2022-01-17T21:19:21Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: mycluster.k8s.local
  name: gpu-nodes
spec:
  image: ami-0826991b05be28c33
  machineType: g4dn.xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/gpu: "1"
    kops.k8s.io/instancegroup: gpu-nodes
  role: Node
  subnets:
  - us-gov-east-1a
  - us-gov-east-1b
  - us-gov-east-1c
  taints:
   - nvidia.com/gpu:NoSchedule

5. What happened after the commands executed? Node will not join cluster. Journalctl logs from node below. One thing I find odd is it keeps adding the taint nvidia.com/gpu:NoSchedule multiple times.

Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.066881   18968 flags.go:59] FLAG: --v="2"
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.066887   18968 flags.go:59] FLAG: --version="false"
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.066895   18968 flags.go:59] FLAG: --vmodule=""
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.066900   18968 flags.go:59] FLAG: --volume-plugin-dir="/usr/libexec/kubernetes/kubelet-plugins/volume/exec/"
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.066907   18968 flags.go:59] FLAG: --volume-stats-agg-period="1m0s"
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.066990   18968 feature_gate.go:243] feature gates: &{map[]}
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.067108   18968 feature_gate.go:243] feature gates: &{map[]}
Jan 17 21:25:28 ip-172-21-45-245 nodeup[1346]: I0117 21:25:28.198538    1346 executor.go:111] Tasks: 87 done / 87 total; 0 can run
Jan 17 21:25:28 ip-172-21-45-245 nodeup[1346]: I0117 21:25:28.198564    1346 context.go:91] deleting temp dir: "/tmp/deploy546807099"
Jan 17 21:25:28 ip-172-21-45-245 nodeup[1346]: success
Jan 17 21:25:28 ip-172-21-45-245 systemd[1]: kops-configuration.service: Succeeded.
Jan 17 21:25:28 ip-172-21-45-245 systemd[1]: Finished Run kOps bootstrap (nodeup).
Jan 17 21:25:28 ip-172-21-45-245 cloud-init[1287]: I0117 21:25:28.205210    1309 service.go:362] Enabling service "kops-configuration.service"
Jan 17 21:25:28 ip-172-21-45-245 systemd[1]: Started Kubernetes systemd probe.
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.210993   18968 mount_linux.go:206] Detected OS with systemd
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.211107   18968 server.go:440] "Kubelet version" kubeletVersion="v1.21.4"
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.211173   18968 feature_gate.go:243] feature gates: &{map[]}
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.211264   18968 feature_gate.go:243] feature gates: &{map[]}
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: W0117 21:25:28.211352   18968 plugins.go:105] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will be removed in a future release
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.211471   18968 aws.go:1261] Building AWS cloudprovider
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.211518   18968 aws.go:1221] Zone not specified in configuration file; querying AWS metadata service
Jan 17 21:25:28 ip-172-21-45-245 systemd[1]: Reloading.
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.336739   18968 tags.go:79] AWS cloud filtering on ClusterID: mycluster.k8s.local
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.336789   18968 server.go:552] "Successfully initialized cloud provider" cloudProvider="aws" cloudConfigFile="/etc/kubernetes/in-tree-cloud.config"
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.336813   18968 server.go:989] "Cloud provider determined current node" nodeName="ip-172-21-45-245.us-gov-east-1.compute.internal"
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.356924   18968 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca-bundle::/srv/kubernetes/ca.crt"
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.357072   18968 dynamic_cafile_content.go:167] Starting client-ca-bundle::/srv/kubernetes/ca.crt
Jan 17 21:25:28 ip-172-21-45-245 kubelet[18968]: I0117 21:25:28.357300   18968 manager.go:165] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct/system.slice/kubelet.service"
Jan 17 21:25:28 ip-172-21-45-245 cloud-init[1287]: I0117 21:25:28.399338    1309 executor.go:111] Tasks: 2 done / 2 total; 0 can run
Jan 17 21:25:28 ip-172-21-45-245 cloud-init[1287]: I0117 21:25:28.399361    1309 context.go:91] deleting temp dir: "/tmp/deploy168227081"
Jan 17 21:25:28 ip-172-21-45-245 systemd[1]: run-r96354adab40a45d3ad40d90eccad3207.scope: Succeeded.
Jan 17 21:25:28 ip-172-21-45-245 cloud-init[1287]: service installed== nodeup node config done ==
Jan 17 21:25:28 ip-172-21-45-245 cloud-init[19037]: #############################################################
Jan 17 21:25:28 ip-172-21-45-245 cloud-init[19038]: -----BEGIN SSH HOST KEY FINGERPRINTS-----
Jan 17 21:25:28 ip-172-21-45-245 cloud-init[19047]: -----END SSH HOST KEY FINGERPRINTS-----
Jan 17 21:25:28 ip-172-21-45-245 cloud-init[19048]: #############################################################
Jan 17 21:25:28 ip-172-21-45-245 cloud-init[1287]: Cloud-init v. 21.4-0ubuntu1~20.04.1 running 'modules:final' at Mon, 17 Jan 2022 21:22:50 +0000. Up 41.01 seconds.
Jan 17 21:25:28 ip-172-21-45-245 cloud-init[1287]: Cloud-init v. 21.4-0ubuntu1~20.04.1 finished at Mon, 17 Jan 2022 21:25:28 +0000. Datasource DataSourceEc2Local.  Up 199.46 seconds
Jan 17 21:25:28 ip-172-21-45-245 systemd[1]: Finished Execute cloud user/final scripts.
Jan 17 21:25:28 ip-172-21-45-245 systemd[1]: Reached target Cloud-init target.
Jan 17 21:25:28 ip-172-21-45-245 systemd[1]: Startup finished in 4.564s (kernel) + 3min 14.948s (userspace) = 3min 19.512s.
Jan 17 21:25:29 ip-172-21-45-245 protokube[18735]: I0117 21:25:29.320928   18735 peer.go:111] OnGossip &KVState{Records:map[string]*KVStateRecord{dns/local/A/api.internal.mycluster.k8s.local: &KVStateRecord{Data:[49 55 50 46 50 49 46 49 48 55 46 49 54 54 44 49 55 50 46 50 49 46 52 54 46 49 54 56 44 49 55 50 46 50 49 46 57 52 46 52],Tombstone:false,Version:1642434593,},dns/local/A/etcd-a.internal.mycluster.k8s.local: &KVStateRecord{Data:[],Tombstone:true,Version:1642434476,},dns/local/A/etcd-b.internal.mycluster.k8s.local: &KVStateRecord{Data:[],Tombstone:true,Version:1642434476,},dns/local/A/etcd-c.internal.mycluster.k8s.local: &KVStateRecord{Data:[],Tombstone:true,Version:1642434476,},dns/local/A/etcd-events-a.internal.mycluster.k8s.local: &KVStateRecord{Data:[],Tombstone:true,Version:1642434476,},dns/local/A/etcd-events-b.internal.mycluster.k8s.local: &KVStateRecord{Data:[],Tombstone:true,Version:1642434476,},dns/local/A/etcd-events-c.internal.mycluster.k8s.local: &KVStateRecord{Data:[],Tombstone:true,Version:1642434476,},dns/local/A/kops-controller.internal.mycluster.k8s.local: &KVStateRecord{Data:[49 55 50 46 50 49 46 49 48 55 46 49 54 54 44 49 55 50 46 50 49 46 52 54 46 49 54 56 44 49 55 50 46 50 49 46 57 52 46 52],Tombstone:false,Version:1642434538,},dns/local/NS/local: &KVStateRecord{Data:[103 111 115 115 105 112],Tombstone:false,Version:1642454715,},},} => delta empty
Jan 17 21:25:30 ip-172-21-45-245 protokube[18735]: I0117 21:25:30.486640   18735 dns.go:47] DNSView unchanged: 2
Jan 17 21:25:33 ip-172-21-45-245 kubelet[18968]: I0117 21:25:33.361015   18968 fs.go:131] Filesystem UUIDs: map[c3ad9070-ac05-4268-a221-2a018f8fed53:/dev/nvme0n1p1]
Jan 17 21:25:33 ip-172-21-45-245 kubelet[18968]: I0117 21:25:33.361043   18968 fs.go:132] Filesystem partitions: map[/dev/root:{mountpoint:/ major:259 minor:2 fsType:ext4 blockSize:0} /dev/shm:{mountpoint:/dev/shm major:0 minor:24 fsType:tmpfs blockSize:0} /run:{mountpoint:/run major:0 minor:26 fsType:tmpfs blockSize:0} /run/lock:{mountpoint:/run/lock major:0 minor:27 fsType:tmpfs blockSize:0} /run/snapd/ns:{mountpoint:/run/snapd/ns major:0 minor:26 fsType:tmpfs blockSize:0} /run/user/1000:{mountpoint:/run/user/1000 major:0 minor:49 fsType:tmpfs blockSize:0} /sys/fs/cgroup:{mountpoint:/sys/fs/cgroup major:0 minor:28 fsType:tmpfs blockSize:0}]
Jan 17 21:25:33 ip-172-21-45-245 kernel: nvidia: loading out-of-tree module taints kernel.
Jan 17 21:25:33 ip-172-21-45-245 kernel: nvidia: module license 'NVIDIA' taints kernel.
Jan 17 21:25:33 ip-172-21-45-245 kernel: Disabling lock debugging due to kernel taint
Jan 17 21:25:33 ip-172-21-45-245 kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Jan 17 21:25:33 ip-172-21-45-245 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 239
Jan 17 21:25:33 ip-172-21-45-245 kernel:
Jan 17 21:25:33 ip-172-21-45-245 kernel: PCI Interrupt Link [LNKB] enabled at IRQ 10
Jan 17 21:25:33 ip-172-21-45-245 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  460.106.00  Tue Sep 28 12:05:58 UTC 2021
Jan 17 21:25:34 ip-172-21-45-245 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  460.106.00  Tue Sep 28 11:57:18 UTC 2021
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.206822   18968 nvidia.go:116] NVML initialized. Number of NVIDIA devices: 1
Jan 17 21:25:34 ip-172-21-45-245 kernel: [drm] [nvidia-drm] [GPU ID 0x0000001e] Loading driver
Jan 17 21:25:34 ip-172-21-45-245 kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:1e.0 on minor 0
Jan 17 21:25:34 ip-172-21-45-245 kernel: nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
Jan 17 21:25:34 ip-172-21-45-245 kernel: nvidia-uvm: Loaded the UVM driver, major device number 234.
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.233290   18968 manager.go:213] Machine: {Timestamp:2022-01-17 21:25:34.23301869 +0000 UTC m=+6.249135639 NumCores:4 NumPhysicalCores:2 NumSockets:1 CpuFrequency:2499998 MemoryCapacity:16471494656 MemoryByType:map[] NVMInfo:{MemoryModeCapacity:0 AppDirectModeCapacity:0 AvgPowerBudget:0} HugePages:[{PageSize:1048576 NumPages:0} {PageSize:2048 NumPages:0}] MachineID:ec282e9543b0517ea7844f3881c86787 SystemUUID:ec282e95-43b0-517e-a784-4f3881c86787 BootID:b0c81aee-017f-4c3a-a936-ff9a96ecbe11 Filesystems:[{Device:/dev/shm DeviceMajor:0 DeviceMinor:24 Capacity:8235745280 Type:vfs Inodes:2010680 HasInodes:true} {Device:/run DeviceMajor:0 DeviceMinor:26 Capacity:1647153152 Type:vfs Inodes:2010680 HasInodes:true} {Device:/run/lock DeviceMajor:0 DeviceMinor:27 Capacity:5242880 Type:vfs Inodes:2010680 HasInodes:true} {Device:/sys/fs/cgroup DeviceMajor:0 DeviceMinor:28 Capacity:8235745280 Type:vfs Inodes:2010680 HasInodes:true} {Device:/run/snapd/ns DeviceMajor:0 DeviceMinor:26 Capacity:1647153152 Type:vfs Inodes:2010680 HasInodes:true} {Device:/run/user/1000 DeviceMajor:0 DeviceMinor:49 Capacity:1647149056 Type:vfs Inodes:2010680 HasInodes:true} {Device:/dev/root DeviceMajor:259 DeviceMinor:2 Capacity:133167038464 Type:vfs Inodes:16384000 HasInodes:true}] DiskMap:map[259:0:{Name:nvme1n1 Major:259 Minor:0 Size:125000000000 Scheduler:none} 259:1:{Name:nvme0n1 Major:259 Minor:1 Size:137438953472 Scheduler:none}] NetworkDevices:[{Name:ens5 MacAddress:06:61:4d:da:ff:8c Speed:0 Mtu:9001}] Topology:[{Id:0 Memory:16471494656 HugePages:[{PageSize:1048576 NumPages:0} {PageSize:2048 NumPages:0}] Cores:[{Id:0 Threads:[0 2] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:1048576 Type:Unified Level:2}] SocketID:0} {Id:1 Threads:[1 3] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:1048576 Type:Unified Level:2}] SocketID:0}] Caches:[{Size:37486592 Type:Unified Level:3}]}] CloudProvider:Unknown InstanceType:Unknown InstanceID:None}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.233447   18968 manager_no_libpfm.go:28] cAdvisor is build without cgo and/or libpfm support. Perf event counters are not available.
Jan 17 21:25:34 ip-172-21-45-245 systemd[1]: Starting NVIDIA Persistence Daemon...
Jan 17 21:25:34 ip-172-21-45-245 nvidia-persistenced[19078]: Verbose syslog connection opened
Jan 17 21:25:34 ip-172-21-45-245 nvidia-persistenced[19078]: Now running with user ID 115 and group ID 119
Jan 17 21:25:34 ip-172-21-45-245 nvidia-persistenced[19078]: Started (19078)
Jan 17 21:25:34 ip-172-21-45-245 nvidia-persistenced[19078]: device 0000:00:1e.0 - registered
Jan 17 21:25:34 ip-172-21-45-245 nvidia-persistenced[19078]: Local RPC services initialized
Jan 17 21:25:34 ip-172-21-45-245 systemd[1]: Started NVIDIA Persistence Daemon.
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.250993   18968 manager.go:229] Version: {KernelVersion:5.11.0-1025-aws ContainerOsVersion:Ubuntu 20.04.3 LTS DockerVersion:Unknown DockerAPIVersion:Unknown CadvisorVersion: CadvisorRevision:}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251321   18968 container_manager_linux.go:278] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251426   18968 container_manager_linux.go:283] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:remote CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:systemd KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251451   18968 topology_manager.go:120] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251469   18968 container_manager_linux.go:314] "Initializing Topology Manager" policy="none" scope="container"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251479   18968 container_manager_linux.go:319] "Creating device plugin manager" devicePluginEnabled=true
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251489   18968 manager.go:136] "Creating Device Plugin manager" path="/var/lib/kubelet/device-plugins/kubelet.sock"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251623   18968 remote_runtime.go:62] parsed scheme: ""
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251633   18968 remote_runtime.go:62] scheme "" not registered, fallback to default scheme
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251676   18968 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251687   18968 clientconn.go:948] ClientConn switching balancer to "pick_first"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251762   18968 remote_image.go:50] parsed scheme: ""
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251768   18968 remote_image.go:50] scheme "" not registered, fallback to default scheme
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251777   18968 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251783   18968 clientconn.go:948] ClientConn switching balancer to "pick_first"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251809   18968 server.go:989] "Cloud provider determined current node" nodeName="ip-172-21-45-245.us-gov-east-1.compute.internal"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251834   18968 server.go:1131] "Using root directory" path="/var/lib/kubelet"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251897   18968 kubelet.go:404] "Attempting to sync node with API server"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251907   18968 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000586f70, {CONNECTING <nil>}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251916   18968 kubelet.go:272] "Adding static pod path" path="/etc/kubernetes/manifests"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251945   18968 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000587100, {CONNECTING <nil>}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251971   18968 file.go:68] "Watching path" path="/etc/kubernetes/manifests"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.251991   18968 kubelet.go:283] "Adding apiserver pod source"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.252029   18968 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.252427   18968 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000586f70, {READY <nil>}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.252976   18968 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000587100, {READY <nil>}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.253532   18968 kuberuntime_manager.go:222] "Container runtime initialized" containerRuntime="containerd" version="v1.4.12" apiVersion="v1alpha2"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: W0117 21:25:34.263938   18968 probe.go:268] Flexvolume plugin directory at /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ does not exist. Recreating.
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264135   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/azure-file"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264153   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/vsphere-volume"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264163   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/aws-ebs"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264172   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/gce-pd"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264181   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/cinder"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264190   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/azure-disk"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264210   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/empty-dir"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264220   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/git-repo"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264239   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/host-path"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264248   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/nfs"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264257   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/secret"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264273   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/iscsi"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264283   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/glusterfs"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264302   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/rbd"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264311   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/quobyte"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264320   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/cephfs"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264330   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/downward-api"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264339   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/fc"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264348   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/flocker"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264358   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/configmap"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264368   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/projected"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264390   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/portworx-volume"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264405   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/scaleio"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264415   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/local-volume"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264425   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/storageos"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264451   18968 plugins.go:639] Loaded volume plugin "kubernetes.io/csi"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.264740   18968 server.go:1190] "Started kubelet"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.266051   18968 cri_stats_provider.go:369] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.266083   18968 kubelet.go:1306] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.266301   18968 server.go:175] "Starting to listen read-only" address="0.0.0.0" port=10255
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.267806   18968 server.go:149] "Starting to listen" address="0.0.0.0" port=10250
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.270160   18968 server.go:405] "Adding debug handlers to kubelet server"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.271548   18968 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.273986   18968 volume_manager.go:269] "The desired_state_of_world populator starts"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.274094   18968 volume_manager.go:271] "Starting Kubelet Volume Manager"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.274635   18968 desired_state_of_world_populator.go:141] "Desired state populator starts to run"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.275326   18968 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.275355   18968 factory.go:55] Registering systemd factory
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.275571   18968 client.go:86] parsed scheme: "unix"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.275585   18968 client.go:86] scheme "unix" not registered, fallback to default scheme
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.275606   18968 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.275613   18968 clientconn.go:948] ClientConn switching balancer to "pick_first"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.275698   18968 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc00126ea30, {CONNECTING <nil>}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.277024   18968 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc00126ea30, {READY <nil>}
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.277479   18968 factory.go:137] Registering containerd factory
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.277632   18968 factory.go:101] Registering Raw factory
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.277713   18968 manager.go:1203] Started watching for new ooms in manager
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.278249   18968 manager.go:301] Starting recovery of all containers
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.287629   18968 manager.go:306] Recovery completed
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.302713   18968 nodeinfomanager.go:401] Failed to publish CSINode: nodes "ip-172-21-45-245.us-gov-east-1.compute.internal" not found
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.307898   18968 nodelease.go:49] "Failed to get node when trying to set owner ref to the node lease" err="nodes \"ip-172-21-45-245.us-gov-east-1.compute.internal\" not found" node="ip-172-21-45-245.us-gov-east-1.compute.internal"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.325253   18968 nodeinfomanager.go:401] Failed to publish CSINode: nodes "ip-172-21-45-245.us-gov-east-1.compute.internal" not found
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.343385   18968 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv4
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.351273   18968 kubelet_node_status.go:362] "Setting node annotation to enable volume controller attach/detach"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.351305   18968 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="g4dn.xlarge"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.351318   18968 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="g4dn.xlarge"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.351330   18968 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="us-gov-east-1a"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.351342   18968 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="us-gov-east-1a"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.351356   18968 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="us-gov-east-1"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.351367   18968 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="us-gov-east-1"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.370854   18968 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv6
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.370872   18968 status_manager.go:157] "Starting to sync pod status with apiserver"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.370883   18968 kubelet.go:1846] "Starting kubelet main sync loop"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.370918   18968 kubelet.go:1870] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.374680   18968 kubelet_node_status.go:362] "Setting node annotation to enable volume controller attach/detach"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.374708   18968 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="g4dn.xlarge"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.374721   18968 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="g4dn.xlarge"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.374732   18968 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="us-gov-east-1a"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.374743   18968 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="us-gov-east-1a"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.374756   18968 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="us-gov-east-1"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.374767   18968 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="us-gov-east-1"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.374690   18968 kubelet.go:2291] "Error getting node" err="node \"ip-172-21-45-245.us-gov-east-1.compute.internal\" not found"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.457208   18968 nodeinfomanager.go:401] Failed to publish CSINode: nodes "ip-172-21-45-245.us-gov-east-1.compute.internal" not found
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.471640   18968 kubelet.go:1870] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.475766   18968 kubelet.go:2291] "Error getting node" err="node \"ip-172-21-45-245.us-gov-east-1.compute.internal\" not found"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.576020   18968 kubelet.go:2291] "Error getting node" err="node \"ip-172-21-45-245.us-gov-east-1.compute.internal\" not found"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.606480   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasSufficientMemory"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.606519   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasSufficientMemory"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.606532   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasNoDiskPressure"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.606537   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasNoDiskPressure"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.606548   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasSufficientPID"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.606549   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasSufficientPID"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.606576   18968 kubelet_node_status.go:71] "Attempting to register node" node="ip-172-21-45-245.us-gov-east-1.compute.internal"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.607077   18968 cpu_manager.go:199] "Starting CPU manager" policy="none"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.607091   18968 cpu_manager.go:200] "Reconciling" reconcilePeriod="10s"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.607104   18968 state_mem.go:36] "Initialized new in-memory state store"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.611178   18968 policy_none.go:44] "None policy: Start"
Jan 17 21:25:34 ip-172-21-45-245 systemd[1]: Created slice libcontainer container kubepods.slice.
Jan 17 21:25:34 ip-172-21-45-245 systemd[1]: Created slice libcontainer container kubepods-burstable.slice.
Jan 17 21:25:34 ip-172-21-45-245 systemd[1]: Created slice libcontainer container kubepods-besteffort.slice.
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.646892   18968 manager.go:242] "Starting Device Plugin manager"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.647047   18968 manager.go:600] "Failed to retrieve checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.647214   18968 manager.go:284] "Serving device plugin registration server on socket" path="/var/lib/kubelet/device-plugins/kubelet.sock"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.647357   18968 plugin_watcher.go:52] "Plugin Watcher Start" path="/var/lib/kubelet/plugins_registry"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.647495   18968 plugin_manager.go:112] "The desired_state_of_world populator (plugin watcher) starts"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.647577   18968 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: W0117 21:25:34.647773   18968 watcher.go:95] Error while processing event ("/sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice: no such file or directory
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.648976   18968 eviction_manager.go:255] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-172-21-45-245.us-gov-east-1.compute.internal\" not found"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.654911   18968 kubelet_node_status.go:93] "Unable to register node with API server" err="Node \"ip-172-21-45-245.us-gov-east-1.compute.internal\" is invalid: metadata.taints[1]: Duplicate value: core.Taint{Key:\"nvidia.com/gpu\", Value:\"\", Effect:\"NoSchedule\", TimeAdded:(*v1.Time)(nil)}: taints must be unique by key and effect pair" node="ip-172-21-45-245.us-gov-east-1.compute.internal"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.671997   18968 kubelet.go:1932] "SyncLoop ADD" source="file" pods=[kube-system/kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal]
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672031   18968 topology_manager.go:187] "Topology Admit Handler"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672050   18968 kubelet_node_status.go:362] "Setting node annotation to enable volume controller attach/detach"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672060   18968 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="g4dn.xlarge"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672072   18968 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="g4dn.xlarge"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672083   18968 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="us-gov-east-1a"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672091   18968 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="us-gov-east-1a"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672100   18968 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="us-gov-east-1"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672107   18968 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="us-gov-east-1"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672646   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasSufficientMemory"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672671   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasNoDiskPressure"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672684   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasSufficientPID"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672795   18968 kubelet_node_status.go:362] "Setting node annotation to enable volume controller attach/detach"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672806   18968 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="g4dn.xlarge"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672817   18968 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="g4dn.xlarge"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672826   18968 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="us-gov-east-1a"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672835   18968 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="us-gov-east-1a"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672844   18968 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="us-gov-east-1"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.672851   18968 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="us-gov-east-1"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.673340   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasSufficientMemory"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.673358   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasNoDiskPressure"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.673367   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasSufficientPID"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.676112   18968 kubelet.go:2291] "Error getting node" err="node \"ip-172-21-45-245.us-gov-east-1.compute.internal\" not found"
Jan 17 21:25:34 ip-172-21-45-245 systemd[1]: Created slice libcontainer container kubepods-burstable-pod7e2fad9ed403dd6771d1b49b8a8e1412.slice.
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.678400   18968 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"iptableslock\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-iptableslock\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.678440   18968 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"logfile\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-logfile\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.678472   18968 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kubeconfig\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-kubeconfig\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.678520   18968 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"modules\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-modules\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.678555   18968 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"ssl-certs-hosts\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-ssl-certs-hosts\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.678589   18968 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"etchosts\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-etchosts\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.776975   18968 kubelet.go:2291] "Error getting node" err="node \"ip-172-21-45-245.us-gov-east-1.compute.internal\" not found"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779117   18968 reconciler.go:269] "operationExecutor.MountVolume started for volume \"ssl-certs-hosts\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-ssl-certs-hosts\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779163   18968 reconciler.go:269] "operationExecutor.MountVolume started for volume \"etchosts\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-etchosts\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779199   18968 reconciler.go:269] "operationExecutor.MountVolume started for volume \"iptableslock\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-iptableslock\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779231   18968 operation_generator.go:698] MountVolume.SetUp succeeded for volume "ssl-certs-hosts" (UniqueName: "kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-ssl-certs-hosts") pod "kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal" (UID: "7e2fad9ed403dd6771d1b49b8a8e1412")
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779233   18968 reconciler.go:269] "operationExecutor.MountVolume started for volume \"logfile\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-logfile\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779261   18968 operation_generator.go:698] MountVolume.SetUp succeeded for volume "logfile" (UniqueName: "kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-logfile") pod "kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal" (UID: "7e2fad9ed403dd6771d1b49b8a8e1412")
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779299   18968 reconciler.go:269] "operationExecutor.MountVolume started for volume \"kubeconfig\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-kubeconfig\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779308   18968 operation_generator.go:698] MountVolume.SetUp succeeded for volume "iptableslock" (UniqueName: "kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-iptableslock") pod "kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal" (UID: "7e2fad9ed403dd6771d1b49b8a8e1412")
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779338   18968 operation_generator.go:698] MountVolume.SetUp succeeded for volume "kubeconfig" (UniqueName: "kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-kubeconfig") pod "kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal" (UID: "7e2fad9ed403dd6771d1b49b8a8e1412")
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779352   18968 operation_generator.go:698] MountVolume.SetUp succeeded for volume "etchosts" (UniqueName: "kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-etchosts") pod "kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal" (UID: "7e2fad9ed403dd6771d1b49b8a8e1412")
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779357   18968 reconciler.go:269] "operationExecutor.MountVolume started for volume \"modules\" (UniqueName: \"kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-modules\") pod \"kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal\" (UID: \"7e2fad9ed403dd6771d1b49b8a8e1412\") "
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.779398   18968 operation_generator.go:698] MountVolume.SetUp succeeded for volume "modules" (UniqueName: "kubernetes.io/host-path/7e2fad9ed403dd6771d1b49b8a8e1412-modules") pod "kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal" (UID: "7e2fad9ed403dd6771d1b49b8a8e1412")
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.855337   18968 kubelet_node_status.go:362] "Setting node annotation to enable volume controller attach/detach"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.855361   18968 kubelet_node_status.go:410] "Adding label from cloud provider" labelKey="beta.kubernetes.io/instance-type" labelValue="g4dn.xlarge"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.855374   18968 kubelet_node_status.go:412] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="g4dn.xlarge"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.855386   18968 kubelet_node_status.go:423] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/zone" labelValue="us-gov-east-1a"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.855397   18968 kubelet_node_status.go:425] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/zone" labelValue="us-gov-east-1a"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.855409   18968 kubelet_node_status.go:429] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="us-gov-east-1"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.855420   18968 kubelet_node_status.go:431] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="us-gov-east-1"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.855949   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasSufficientMemory"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.856021   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasNoDiskPressure"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.856073   18968 kubelet_node_status.go:554] "Recording event message for node" node="ip-172-21-45-245.us-gov-east-1.compute.internal" event="NodeHasSufficientPID"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: I0117 21:25:34.856102   18968 kubelet_node_status.go:71] "Attempting to register node" node="ip-172-21-45-245.us-gov-east-1.compute.internal"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.877146   18968 kubelet.go:2291] "Error getting node" err="node \"ip-172-21-45-245.us-gov-east-1.compute.internal\" not found"
Jan 17 21:25:34 ip-172-21-45-245 kubelet[18968]: E0117 21:25:34.977594   18968 kubelet.go:2291] "Error getting node" err="node \"ip-172-21-45-245.us-gov-east-1.compute.internal\" not found"
Jan 17 21:25:35 ip-172-21-45-245 kubelet[18968]: I0117 21:25:35.008866   18968 kuberuntime_manager.go:460] "No sandbox for pod can be found. Need to start a new one" pod="kube-system/kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal"
Jan 17 21:25:35 ip-172-21-45-245 containerd[18745]: time="2022-01-17T21:25:35.009197230Z" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:kube-proxy-ip-172-21-45-245.us-gov-east-1.compute.internal,Uid:7e2fad9ed403dd6771d1b49b8a8e1412,Namespace:kube-system,Attempt:0,}"
Jan 17 21:25:35 ip-172-21-45-245 kubelet[18968]: E0117 21:25:35.078524   18968 kubelet.go:2291] "Error getting node" err="node \"ip-172-21-45-245.us-gov-east-1.compute.internal\" not found"
Jan 17 21:25:35 ip-172-21-45-245 kubelet[18968]: E0117 21:25:35.178968   18968 kubelet.go:2291] "Error getting node" err="node \"ip-172-21-45-245.us-gov-east-1.compute.internal\" not found"

6. What did you expect to happen? Node to join cluster.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2021-09-23T20:52:50Z"
  generation: 9
  name: mycluster.k8s.local
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect":"Allow",
          "Action":[
            "ssm:DescribeAssociation",
            "ssm:GetDeployablePatchSnapshotForInstance",
            "ssm:GetDocument",
            "ssm:DescribeDocument",
            "ssm:GetManifest",
            "ssm:GetParameters",
            "ssm:ListAssociations",
            "ssm:ListInstanceAssociations",
            "ssm:PutInventory",
            "ssm:PutComplianceItems",
            "ssm:PutConfigurePackageResult",
            "ssm:UpdateAssociationStatus",
            "ssm:UpdateInstanceAssociationStatus",
            "ssm:UpdateInstanceInformation"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "ssmmessages:CreateControlChannel",
            "ssmmessages:CreateDataChannel",
            "ssmmessages:OpenControlChannel",
            "ssmmessages:OpenDataChannel"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "ec2messages:AcknowledgeMessage",
            "ec2messages:DeleteMessage",
            "ec2messages:FailMessage",
            "ec2messages:GetEndpoint",
            "ec2messages:GetMessages",
            "ec2messages:SendReply"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "cloudwatch:PutMetricData"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "ec2:DescribeInstanceStatus"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "ds:CreateComputer",
            "ds:DescribeDirectories"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "logs:CreateLogGroup",
            "logs:CreateLogStream",
            "logs:DescribeLogGroups",
            "logs:DescribeLogStreams",
            "logs:PutLogEvents"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "s3:GetBucketLocation",
            "s3:PutObject",
            "s3:GetObject",
            "s3:GetEncryptionConfiguration",
            "s3:AbortMultipartUpload",
            "s3:ListMultipartUploadParts",
            "s3:ListBucket",
            "s3:ListBucketMultipartUploads"
          ],
          "Resource":[
            "*"
          ]
        }
      ]
    node: |
      [
        {
          "Effect":"Allow",
          "Action":[
            "ssm:DescribeAssociation",
            "ssm:GetDeployablePatchSnapshotForInstance",
            "ssm:GetDocument",
            "ssm:DescribeDocument",
            "ssm:GetManifest",
            "ssm:GetParameters",
            "ssm:ListAssociations",
            "ssm:ListInstanceAssociations",
            "ssm:PutInventory",
            "ssm:PutComplianceItems",
            "ssm:PutConfigurePackageResult",
            "ssm:UpdateAssociationStatus",
            "ssm:UpdateInstanceAssociationStatus",
            "ssm:UpdateInstanceInformation"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "ssmmessages:CreateControlChannel",
            "ssmmessages:CreateDataChannel",
            "ssmmessages:OpenControlChannel",
            "ssmmessages:OpenDataChannel"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "ec2messages:AcknowledgeMessage",
            "ec2messages:DeleteMessage",
            "ec2messages:FailMessage",
            "ec2messages:GetEndpoint",
            "ec2messages:GetMessages",
            "ec2messages:SendReply"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "cloudwatch:PutMetricData"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "ec2:DescribeInstanceStatus"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "ds:CreateComputer",
            "ds:DescribeDirectories"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "logs:CreateLogGroup",
            "logs:CreateLogStream",
            "logs:DescribeLogGroups",
            "logs:DescribeLogStreams",
            "logs:PutLogEvents"
          ],
          "Resource":[
            "*"
          ]
        },
        {
          "Effect":"Allow",
          "Action":[
            "s3:GetBucketLocation",
            "s3:PutObject",
            "s3:GetObject",
            "s3:GetEncryptionConfiguration",
            "s3:AbortMultipartUpload",
            "s3:ListMultipartUploadParts",
            "s3:ListBucket",
            "s3:ListBucketMultipartUploads"
          ],
          "Resource":[
            "*"
          ]
        }
      ]
  api:
    loadBalancer:
      class: Classic
      type: Internal
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://mycluster-state-store-12345/mycluster.k8s.local
  containerd:
    nvidiaGPU:
      enabled: true
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-gov-east-1a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-gov-east-1b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-gov-east-1c
      name: c
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-gov-east-1a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-gov-east-1b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-gov-east-1c
      name: c
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.21.4
  masterInternalName: api.internal.mycluster.k8s.local
  masterPublicName: api.mycluster.k8s.local
  networkCIDR: 172.21.0.0/16
  networking:
    amazonvpc:
      imageName: 151742754352.dkr.ecr.us-gov-east-1.amazonaws.com/amazon-k8s-cni:v1.6.1
      initImageName: 151742754352.dkr.ecr.us-gov-east-1.amazonaws.com/amazon-k8s-cni-init:v1.7.10
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.21.32.0/19
    name: us-gov-east-1a
    type: Private
    zone: us-gov-east-1a
  - cidr: 172.21.64.0/19
    name: us-gov-east-1b
    type: Private
    zone: us-gov-east-1b
  - cidr: 172.21.96.0/19
    name: us-gov-east-1c
    type: Private
    zone: us-gov-east-1c
  - cidr: 172.21.0.0/22
    name: utility-us-gov-east-1a
    type: Utility
    zone: us-gov-east-1a
  - cidr: 172.21.4.0/22
    name: utility-us-gov-east-1b
    type: Utility
    zone: us-gov-east-1b
  - cidr: 172.21.8.0/22
    name: utility-us-gov-east-1c
    type: Utility
    zone: us-gov-east-1c
  topology:
    dns:
      type: Private
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2022-01-17T21:19:21Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: mycluster.k8s.local
  name: gpu-nodes
spec:
  image: ami-0826991b05be28c33
  machineType: g4dn.xlarge
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/gpu: "1"
    kops.k8s.io/instancegroup: gpu-nodes
  role: Node
  subnets:
  - us-gov-east-1a
  - us-gov-east-1b
  - us-gov-east-1c
  taints:
  - nvidia.com/gpu:NoSchedule

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-09-23T20:52:50Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: mycluster.k8s.local
  name: master-us-gov-east-1a
spec:
  image: ami-05937974
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    hostname: mdap-node-1
    kops.k8s.io/instancegroup: master-us-gov-east-1a
  role: Master
  subnets:
  - us-gov-east-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-09-23T20:52:50Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: mycluster.k8s.local
  name: master-us-gov-east-1b
spec:
  image: ami-05937974
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    hostname: mdap-node-2
    kops.k8s.io/instancegroup: master-us-gov-east-1b
    tier: relational-storage
  role: Master
  subnets:
  - us-gov-east-1b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-09-23T20:52:50Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: mycluster.k8s.local
  name: master-us-gov-east-1c
spec:
  image: ami-05937974
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    hostname: mdap-node-3
    kops.k8s.io/instancegroup: master-us-gov-east-1c
    tier: relational-storage
  role: Master
  subnets:
  - us-gov-east-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-09-23T20:52:50Z"
  generation: 3
  labels:
    kops.k8s.io/cluster: mycluster.k8s.local
  name: nodes-us-gov-east-1a
spec:
  image: ami-05937974
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    hostname: mdap-node-4
    kops.k8s.io/instancegroup: nodes-us-gov-east-1a
    tier: relational-storage
  role: Node
  subnets:
  - us-gov-east-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-10-13T22:42:39Z"
  labels:
    kops.k8s.io/cluster: mycluster.k8s.local
  name: nodes-us-gov-east-1a-2
spec:
  image: ami-05937974
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    hostname: mdap-node-5
    kops.k8s.io/instancegroup: nodes-us-gov-east-1a-2
  role: Node
  subnets:
  - us-gov-east-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-09-23T20:52:50Z"
  generation: 4
  labels:
    kops.k8s.io/cluster: mycluster.k8s.local
  name: nodes-us-gov-east-1b
spec:
  image: ami-05937974
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    hostname: mdap-node-6
    kops.k8s.io/instancegroup: nodes-us-gov-east-1b
    tier: block-storage
    tier.mapping: mapping
  role: Node
  subnets:
  - us-gov-east-1b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-10-13T22:43:30Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: mycluster.k8s.local
  name: nodes-us-gov-east-1b-2
spec:
  image: ami-05937974
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    hostname: mdap-node-7
    kops.k8s.io/instancegroup: nodes-us-gov-east-1b-2
    tier: block-storage
  role: Node
  subnets:
  - us-gov-east-1b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-09-23T20:52:50Z"
  generation: 4
  labels:
    kops.k8s.io/cluster: mycluster.k8s.local
  name: nodes-us-gov-east-1c
spec:
  image: ami-05937974
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    hostname: mdap-node-8
    kops.k8s.io/instancegroup: nodes-us-gov-east-1c
    tier: block-storage
  role: Node
  subnets:
  - us-gov-east-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-10-13T22:44:18Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: mycluster.k8s.local
  name: nodes-us-gov-east-1c-2
spec:
  image: ami-05937974
  machineType: t3.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    hostname: mdap-node-9
    kops.k8s.io/instancegroup: nodes-us-gov-east-1c-2
    tier: block-storage
  role: Node
  subnets:
  - us-gov-east-1c

*

olemarkus commented 2 years ago

I think this is just us not deduping the taint your provide. Can you try removing

  taints:
   - nvidia.com/gpu:NoSchedule

kOps already adds this.

tionichm commented 2 years ago

I'm experiencing the same issue. The only way I can resolve it is by deleting my GPU node IG and remaking it every time I want to make edits.

olemarkus commented 2 years ago

The same error from kubelet and you can confirm that you have not set the above taint in the manifest directly?

tionichm commented 2 years ago

The same error from kubelet and you can confirm that you have not set the above taint in the manifest directly?

Yes. I normally spin up my cluster to do some dev work, and switch it off afterwards. (By off I mean draining my nodes and setting minSize and maxSize to zero). Every time I spin up my cluster I need to recreate my gpu-nodes ig, otherwise I run into the same problem. The taint gets added after every edit/save and the node won't join the cluster. Usually it looks something like this:

taints:
   - nvidia.com/gpu:NoSchedule

And once edited and saved:

  taints:
   - nvidia.com/gpu:NoSchedule
   - nvidia.com/gpu:NoSchedule
olemarkus commented 2 years ago

Try removing both entries

tionichm commented 2 years ago

I've tried that, but then the taint is re-added after the save, unfortunately.

olemarkus commented 2 years ago

I have an idea what is happening here ... /assign

ptdtan commented 2 years ago

I'm experiencing the same issue here. Do we have any progress on this?

olemarkus commented 2 years ago

You are using kops 1.23.0 right? Can you open a new issue?

ptdtan commented 2 years ago

Sure, please see this: https://github.com/kubernetes/kops/issues/13468