kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.98k stars 4.65k forks source link

AmazonVPC CNI broken in Kops #16734

Open lukasmrtvy opened 3 months ago

lukasmrtvy commented 3 months ago

/kind bug

1. What kops version are you running? The command kops version, will display this information.

1.29.2

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.29.7

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

export AWS_ACCESS_KEY_ID=XXXX
export AWS_SECRET_ACCESS_KEY=XXXX
export AWS_REGION=eu-central-1
export KOPS_STATE_STORE=s3://example-kops-state-store

kops create -f kops.yaml
kops update cluster --name test.example.com --yes --admin

5. What happened after the commands executed?

Cluster is created, CNI is broken ( AmazonVPC ), Pod->Pod, Pod->Service fails to i/o timeout Running mixed topology, Control Plane, Workloads in private subnet, Gateway nodes in public subnet

NAMESPACE     NAME                                            READY   STATUS             RESTARTS        AGE
default       test678aa                                       1/1     Running            0               6m7s
kube-system   aws-cloud-controller-manager-2sk5s              1/1     Running            0               11m
kube-system   aws-node-8jw57                                  2/2     Running            0               8m49s
kube-system   aws-node-nftng                                  2/2     Running            0               9m17s
kube-system   aws-node-phkdf                                  2/2     Running            0               11m
kube-system   aws-node-termination-handler-5b988d67cd-2hjlb   0/1     CrashLoopBackOff   6 (45s ago)     11m
kube-system   coredns-78ccb5b8c5-4rq4c                        1/1     Running            0               8m16s
kube-system   coredns-78ccb5b8c5-gmzx5                        0/1     Running            4 (60s ago)     11m
kube-system   coredns-autoscaler-55c99b49b7-pffqc             1/1     Running            0               11m
kube-system   ebs-csi-controller-65676964b6-7vx7d             5/6     CrashLoopBackOff   9 (15s ago)     11m
kube-system   ebs-csi-node-4ldz5                              3/3     Running            7 (2m51s ago)   10m
kube-system   ebs-csi-node-5rl4v                              2/3     CrashLoopBackOff   6 (50s ago)     9m17s
kube-system   ebs-csi-node-zddbk                              2/3     CrashLoopBackOff   6 (41s ago)     8m49s
kube-system   etcd-manager-events-i-0fe4d8007f51c493b         1/1     Running            0               10m
kube-system   etcd-manager-main-i-0fe4d8007f51c493b           1/1     Running            0               9m44s
kube-system   kops-controller-7cbpn                           1/1     Running            0               11m
kube-system   kube-apiserver-i-0fe4d8007f51c493b              2/2     Running            2 (11m ago)     10m
kube-system   kube-controller-manager-i-0fe4d8007f51c493b     1/1     Running            3 (11m ago)     10m
kube-system   kube-proxy-i-001e89332beaa4ab7                  1/1     Running            0               9m17s
kube-system   kube-proxy-i-0182e11c841a4f31b                  1/1     Running            0               8m49s
kube-system   kube-proxy-i-0fe4d8007f51c493b                  1/1     Running            0               10m
kube-system   kube-scheduler-i-0fe4d8007f51c493b              1/1     Running            0               10m

6. What did you expect to happen?

CNI is working

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: test.example.com
spec:
  api:
    loadBalancer:
      class: Network
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://example-kops-state-store/test.example.com
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-eu-central-1a
      name: a
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-eu-central-1a
      name: a
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
    useServiceAccountExternalPermissions: true
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.29.7
  networkCIDR: 172.20.0.0/16
  networking:
    amazonvpc: {}
  nonMasqueradeCIDR: 172.20.0.0/16
  serviceAccountIssuerDiscovery:
    discoveryStore: s3://example-kops-oidc-store/test.example.com/discovery/test.example.com
    enableAWSOIDCProvider: true
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  subnets:
  - cidr: 172.20.0.0/19
    name: eu-central-1a-public
    type: Public
    zone: eu-central-1a
  - cidr: 172.20.32.0/19
    name: eu-central-1b-public
    type: Public
    zone: eu-central-1b
  - cidr: 172.20.64.0/19
    name: eu-central-1a-private
    type: Private
    zone: eu-central-1a
  - cidr: 172.20.96.0/19
    name: eu-central-1b-private
    type: Private
    zone: eu-central-1b
  - cidr: 172.20.128.0/19
    name: eu-central-1a-Utility
    type: Utility
    zone: eu-central-1a
  - cidr: 172.20.160.0/19
    name: eu-central-1b-Utility
    type: Utility
    zone: eu-central-1b
  topology:
    dns:
      type: None
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: test.example.com
  name: control-plane-eu-central-1a
spec:
  image: 137112412989/al2023-ami-2023.5.20240722.0-kernel-6.1-arm64
  machineType: t4g.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - eu-central-1a-private
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: test.example.com
  name: workload-eu-central-1a
spec:
  image: 137112412989/al2023-ami-2023.5.20240722.0-kernel-6.1-x86_64
  machineType: t3a.xlarge
  maxSize: 3
  minSize: 1
  role: Node
  subnets:
  - eu-central-1a-private
  nodeLabels:
    role: "workload"
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: test.example.com
  name: gateway-eu-central-1a
spec:
  image: 137112412989/al2023-ami-2023.5.20240722.0-kernel-6.1-x86_64
  machineType: t3a.xlarge
  maxSize: 3
  minSize: 1
  role: Node
  subnets:
  - eu-central-1a-public
  nodeLabels:
    role: "gateway"
  taints:
  - node.com/type=gateway:NoSchedule

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale