kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.97k stars 4.65k forks source link

Unable to bootstrap a IPv6 only cluster #11746

Closed day0ops closed 3 years ago

day0ops commented 3 years ago

/kind bug

1. What kops version are you running? The command kops version, will display this information. 1.22.0-alpha.1 (compiled on master branch)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. 1.21.1

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? kops create -f kops.yaml where kops.yaml is given below (in 7).

5. What happened after the commands executed? Lots of crashloopbacks

NAME                                                                       READY   STATUS              RESTARTS   AGE   IP              NODE                                               NOMINATED NODE   READINESS GATES
calico-kube-controllers-78d6f96c7b-qchgl                                   0/1     ContainerCreating   0          17m   <none>          ip-172-20-43-104.ap-southeast-2.compute.internal   <none>           <none>
calico-node-7zwfq                                                          0/1     CrashLoopBackOff    7          15m   172.20.58.176   ip-172-20-58-176.ap-southeast-2.compute.internal   <none>           <none>
calico-node-f2qld                                                          0/1     CrashLoopBackOff    7          15m   172.20.52.209   ip-172-20-52-209.ap-southeast-2.compute.internal   <none>           <none>
calico-node-xc78w                                                          0/1     Running             8          17m   172.20.43.104   ip-172-20-43-104.ap-southeast-2.compute.internal   <none>           <none>
coredns-autoscaler-6f594f4c58-977p8                                        0/1     ContainerCreating   0          17m   <none>          ip-172-20-52-209.ap-southeast-2.compute.internal   <none>           <none>
coredns-f45c4bf76-6lpb7                                                    0/1     ContainerCreating   0          17m   <none>          ip-172-20-52-209.ap-southeast-2.compute.internal   <none>           <none>
dns-controller-5798dc5b54-zcznh                                            1/1     Running             0          17m   172.20.43.104   ip-172-20-43-104.ap-southeast-2.compute.internal   <none>           <none>
etcd-manager-events-ip-172-20-43-104.ap-southeast-2.compute.internal       1/1     Running             0          16m   172.20.43.104   ip-172-20-43-104.ap-southeast-2.compute.internal   <none>           <none>
etcd-manager-main-ip-172-20-43-104.ap-southeast-2.compute.internal         1/1     Running             0          16m   172.20.43.104   ip-172-20-43-104.ap-southeast-2.compute.internal   <none>           <none>
kops-controller-rdbrn                                                      1/1     Running             0          16m   172.20.43.104   ip-172-20-43-104.ap-southeast-2.compute.internal   <none>           <none>
kube-apiserver-ip-172-20-43-104.ap-southeast-2.compute.internal            2/2     Running             0          16m   172.20.43.104   ip-172-20-43-104.ap-southeast-2.compute.internal   <none>           <none>
kube-controller-manager-ip-172-20-43-104.ap-southeast-2.compute.internal   1/1     Running             0          16m   172.20.43.104   ip-172-20-43-104.ap-southeast-2.compute.internal   <none>           <none>
kube-proxy-ip-172-20-43-104.ap-southeast-2.compute.internal                0/1     CrashLoopBackOff    8          17m   172.20.43.104   ip-172-20-43-104.ap-southeast-2.compute.internal   <none>           <none>
kube-proxy-ip-172-20-52-209.ap-southeast-2.compute.internal                0/1     CrashLoopBackOff    7          15m   172.20.52.209   ip-172-20-52-209.ap-southeast-2.compute.internal   <none>           <none>
kube-proxy-ip-172-20-58-176.ap-southeast-2.compute.internal                0/1     CrashLoopBackOff    7          15m   172.20.58.176   ip-172-20-58-176.ap-southeast-2.compute.internal   <none>           <none>
kube-scheduler-ip-172-20-43-104.ap-southeast-2.compute.internal            1/1     Running             0          17m   172.20.43.104   ip-172-20-43-104.ap-southeast-2.compute.internal   <none>           <none>

6. What did you expect to happen? For all the workloads to get IPv6 addresses without any issues

**7. Please provide your cluster manifest.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: test-kops.test.com
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://test-kops-state-store/test-kops.test.com
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-ap-southeast-2a
      name: a
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-ap-southeast-2a
      name: a
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    featureGates:
      IPv6DualStack: "false"
    anonymousAuth: false
  kubeAPIServer:
    featureGates:
      IPv6DualStack: "false"
    bindAddress: "::"
  kubeProxy:
    featureGates:
      IPv6DualStack: "false"
    bindAddress: "::"
  kubeControllerManager:
    featureGates:
      IPv6DualStack: "false"
    allocateNodeCIDRs: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.21.1
  masterPublicName: api.test-kops.test.com
  networking:
    calico: 
      ipv4Support: false
      ipv6Support: true
  nonMasqueradeCIDR: fd12:3456:789a::/48
  networkCIDR: 172.20.0.0/16
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  kubeDNS:
    provider: CoreDNS
    upstreamNameservers:
    - 2620:119:35::35
    - 2620:119:53::53
  subnets:
  - cidr: 172.20.32.0/19
    name: test-kops-subnet
    type: Public
    zone: ap-southeast-2a
  topology:
    dns:
      type: Public
    masters: public
    nodes: public
  cloudConfig:
    awsEBSCSIDriver:
      enabled: false
      version: v1.0.0

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: test-kops.test.com
  name: master-ap-southeast-2a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210415
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-ap-southeast-2a
  role: Master
  subnets:
  - test-kops-subnet
  sysctlParameters:
  - net.ipv6.conf.all.disable_ipv6=0
  - net.ipv6.conf.all.forwarding=1
  - net.ipv6.conf.default.forwarding=1

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: test-kops.test.com
  name: nodes-ap-southeast-2a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210415
  machineType: t3.medium
  maxSize: 2
  minSize: 2
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-ap-southeast-2a
  role: Node
  subnets:
  - test-kops-subnet
  sysctlParameters:
  - net.ipv6.conf.all.disable_ipv6=0
  - net.ipv6.conf.all.forwarding=1
  - net.ipv6.conf.default.forwarding=1

8. More information. kube-proxy is failing due to,

I0612 11:47:35.016647       1 flags.go:59] FLAG: --add-dir-header="false"
I0612 11:47:35.016830       1 flags.go:59] FLAG: --alsologtostderr="true"
I0612 11:47:35.016838       1 flags.go:59] FLAG: --bind-address="::"
I0612 11:47:35.016846       1 flags.go:59] FLAG: --bind-address-hard-fail="false"
I0612 11:47:35.016851       1 flags.go:59] FLAG: --boot-id-file="/proc/sys/kernel/random/boot_id"
I0612 11:47:35.016857       1 flags.go:59] FLAG: --cleanup="false"
I0612 11:47:35.016862       1 flags.go:59] FLAG: --cluster-cidr="fd12:3456:789a:1::/64"
I0612 11:47:35.016870       1 flags.go:59] FLAG: --config=""
I0612 11:47:35.016874       1 flags.go:59] FLAG: --config-sync-period="15m0s"
I0612 11:47:35.017194       1 flags.go:59] FLAG: --conntrack-max-per-core="131072"
I0612 11:47:35.017204       1 flags.go:59] FLAG: --conntrack-min="131072"
I0612 11:47:35.017209       1 flags.go:59] FLAG: --conntrack-tcp-timeout-close-wait="1h0m0s"
I0612 11:47:35.017215       1 flags.go:59] FLAG: --conntrack-tcp-timeout-established="24h0m0s"
I0612 11:47:35.017222       1 flags.go:59] FLAG: --detect-local-mode=""
I0612 11:47:35.017231       1 flags.go:59] FLAG: --feature-gates="IPv6DualStack=false"
I0612 11:47:35.017274       1 flags.go:59] FLAG: --healthz-bind-address="0.0.0.0:10256"
I0612 11:47:35.017283       1 flags.go:59] FLAG: --healthz-port="10256"
I0612 11:47:35.017288       1 flags.go:59] FLAG: --help="false"
I0612 11:47:35.017294       1 flags.go:59] FLAG: --hostname-override="ip-172-20-43-104.ap-southeast-2.compute.internal"
I0612 11:47:35.017300       1 flags.go:59] FLAG: --iptables-masquerade-bit="14"
I0612 11:47:35.017304       1 flags.go:59] FLAG: --iptables-min-sync-period="1s"
I0612 11:47:35.017310       1 flags.go:59] FLAG: --iptables-sync-period="30s"
I0612 11:47:35.017319       1 flags.go:59] FLAG: --ipvs-exclude-cidrs="[]"
I0612 11:47:35.017333       1 flags.go:59] FLAG: --ipvs-min-sync-period="0s"
I0612 11:47:35.017338       1 flags.go:59] FLAG: --ipvs-scheduler=""
I0612 11:47:35.017343       1 flags.go:59] FLAG: --ipvs-strict-arp="false"
I0612 11:47:35.017348       1 flags.go:59] FLAG: --ipvs-sync-period="30s"
I0612 11:47:35.017354       1 flags.go:59] FLAG: --ipvs-tcp-timeout="0s"
I0612 11:47:35.017358       1 flags.go:59] FLAG: --ipvs-tcpfin-timeout="0s"
I0612 11:47:35.017366       1 flags.go:59] FLAG: --ipvs-udp-timeout="0s"
I0612 11:47:35.017370       1 flags.go:59] FLAG: --kube-api-burst="10"
I0612 11:47:35.017375       1 flags.go:59] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0612 11:47:35.017382       1 flags.go:59] FLAG: --kube-api-qps="5"
I0612 11:47:35.017401       1 flags.go:59] FLAG: --kubeconfig="/var/lib/kube-proxy/kubeconfig"
I0612 11:47:35.017406       1 flags.go:59] FLAG: --log-backtrace-at=":0"
I0612 11:47:35.017413       1 flags.go:59] FLAG: --log-dir=""
I0612 11:47:35.017419       1 flags.go:59] FLAG: --log-file="/var/log/kube-proxy.log"
I0612 11:47:35.017424       1 flags.go:59] FLAG: --log-file-max-size="1800"
I0612 11:47:35.017430       1 flags.go:59] FLAG: --log-flush-frequency="5s"
I0612 11:47:35.017435       1 flags.go:59] FLAG: --logtostderr="false"
I0612 11:47:35.017440       1 flags.go:59] FLAG: --machine-id-file="/etc/machine-id,/var/lib/dbus/machine-id"
I0612 11:47:35.017446       1 flags.go:59] FLAG: --masquerade-all="false"
I0612 11:47:35.017452       1 flags.go:59] FLAG: --master="https://127.0.0.1"
I0612 11:47:35.017457       1 flags.go:59] FLAG: --metrics-bind-address="127.0.0.1:10249"
I0612 11:47:35.017462       1 flags.go:59] FLAG: --metrics-port="10249"
I0612 11:47:35.017467       1 flags.go:59] FLAG: --nodeport-addresses="[]"
I0612 11:47:35.017477       1 flags.go:59] FLAG: --one-output="false"
I0612 11:47:35.017482       1 flags.go:59] FLAG: --oom-score-adj="-998"
I0612 11:47:35.017488       1 flags.go:59] FLAG: --profiling="false"
I0612 11:47:35.017493       1 flags.go:59] FLAG: --proxy-mode=""
I0612 11:47:35.017499       1 flags.go:59] FLAG: --proxy-port-range=""
I0612 11:47:35.017507       1 flags.go:59] FLAG: --show-hidden-metrics-for-version=""
I0612 11:47:35.017512       1 flags.go:59] FLAG: --skip-headers="false"
I0612 11:47:35.017521       1 flags.go:59] FLAG: --skip-log-headers="false"
I0612 11:47:35.017532       1 flags.go:59] FLAG: --stderrthreshold="2"
I0612 11:47:35.017537       1 flags.go:59] FLAG: --udp-timeout="250ms"
I0612 11:47:35.017542       1 flags.go:59] FLAG: --v="2"
I0612 11:47:35.017546       1 flags.go:59] FLAG: --version="false"
I0612 11:47:35.017554       1 flags.go:59] FLAG: --vmodule=""
I0612 11:47:35.017559       1 flags.go:59] FLAG: --write-config-to=""
W0612 11:47:35.017567       1 server.go:220] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0612 11:47:35.017629       1 feature_gate.go:243] feature gates: &{map[IPv6DualStack:false]}
I0612 11:47:35.017705       1 feature_gate.go:243] feature gates: &{map[IPv6DualStack:false]}
E0612 11:47:35.058934       1 node.go:161] Failed to retrieve node info: Get "https://127.0.0.1/api/v1/nodes/ip-172-20-43-104.ap-southeast-2.compute.internal": dial tcp 127.0.0.1:443: connect: connection refused
E0612 11:47:36.243461       1 node.go:161] Failed to retrieve node info: Get "https://127.0.0.1/api/v1/nodes/ip-172-20-43-104.ap-southeast-2.compute.internal": dial tcp 127.0.0.1:443: connect: connection refused
E0612 11:47:38.322018       1 node.go:161] Failed to retrieve node info: Get "https://127.0.0.1/api/v1/nodes/ip-172-20-43-104.ap-southeast-2.compute.internal": dial tcp 127.0.0.1:443: connect: connection refused
E0612 11:47:42.862434       1 node.go:161] Failed to retrieve node info: Get "https://127.0.0.1/api/v1/nodes/ip-172-20-43-104.ap-southeast-2.compute.internal": dial tcp 127.0.0.1:443: connect: connection refused
E0612 11:48:00.893863       1 node.go:161] Failed to retrieve node info: Get "https://127.0.0.1/api/v1/nodes/ip-172-20-43-104.ap-southeast-2.compute.internal": net/http: TLS handshake timeout
I0612 11:48:18.488633       1 node.go:172] Successfully retrieved node IP: 172.20.43.104
I0612 11:48:18.488773       1 server_others.go:140] Detected node IP 172.20.43.104
W0612 11:48:18.488822       1 server_others.go:598] Unknown proxy mode "", assuming iptables proxy
I0612 11:48:18.488953       1 server_others.go:177] DetectLocalMode: 'ClusterCIDR'
I0612 11:48:18.536760       1 server_others.go:208] kube-proxy running in single-stack IPv4 mode
I0612 11:48:18.536856       1 server_others.go:212] Using iptables Proxier.
F0612 11:48:18.536909       1 server.go:489] unable to create proxier: CIDR fd12:3456:789a:1::/64 has incorrect IP version: expect isIPv6=false

Calico workloads failing due to,

2021-06-12 12:09:59.947 [INFO][10] startup/startup.go 390: Early log level set to info
2021-06-12 12:09:59.948 [INFO][10] startup/startup.go 406: Using NODENAME environment for node name ip-172-20-58-176.ap-southeast-2.compute.internal
2021-06-12 12:09:59.948 [INFO][10] startup/startup.go 418: Determined node name: ip-172-20-58-176.ap-southeast-2.compute.internal
2021-06-12 12:09:59.948 [INFO][10] startup/startup.go 103: Starting node ip-172-20-58-176.ap-southeast-2.compute.internal with version v3.19.1
2021-06-12 12:09:59.950 [INFO][10] startup/startup.go 450: Checking datastore connection
2021-06-12 12:09:59.950 [INFO][10] startup/startup.go 465: Hit error connecting to datastore - retry error=Get "https://[fd12:3456:789a::1]:443/api/v1/nodes/foo": dial tcp [fd12:3456:789a::1]:443: connect: network is unreachable
2021-06-12 12:10:00.951 [INFO][10] startup/startup.go 465: Hit error connecting to datastore - retry error=Get "https://[fd12:3456:789a::1]:443/api/v1/nodes/foo": dial tcp [fd12:3456:789a::1]:443: connect: network is unreachable
2021-06-12 12:10:01.951 [INFO][10] startup/startup.go 465: Hit error connecting to datastore - retry error=Get "https://[fd12:3456:789a::1]:443/api/v1/nodes/foo": dial tcp [fd12:3456:789a::1]:443: connect: network is unreachable
2021-06-12 12:10:02.952 [INFO][10] startup/startup.go 465: Hit error connecting to datastore - retry error=Get "https://[fd12:3456:789a::1]:443/api/v1/nodes/foo": dial tcp [fd12:3456:789a::1]:443: connect: network is unreachable

Nodes are all live,

NAME                                               STATUS   ROLES                  AGE   VERSION
ip-172-20-43-104.ap-southeast-2.compute.internal   Ready    control-plane,master   31m   v1.21.1
ip-172-20-52-209.ap-southeast-2.compute.internal   Ready    node                   30m   v1.21.1
ip-172-20-58-176.ap-southeast-2.compute.internal   Ready    node                   29m   v1.21.1
day0ops commented 3 years ago

Ive also checked to see if it's possible to configure an advertised address for api server but theres no reference in the cluster spec about it.

hakman commented 3 years ago

@kasunt-nixdev Try adding, should help:

spec
  kubeAPIServer:
    bindAddress: "::"
  subnets:
  - type: Public
    ipv6CIDR: /64#1
day0ops commented 3 years ago

@hakman Thanks but it still appears to be the same problem. So AWS doesnt have probs with the subnets AFAICT. It generates both a IPv4 and IPv6.

However proxy is still failing due to

I0612 21:10:32.121941       1 flags.go:59] FLAG: --add-dir-header="false"
I0612 21:10:32.122659       1 flags.go:59] FLAG: --alsologtostderr="true"
I0612 21:10:32.122674       1 flags.go:59] FLAG: --bind-address="::"
I0612 21:10:32.122683       1 flags.go:59] FLAG: --bind-address-hard-fail="false"
I0612 21:10:32.122727       1 flags.go:59] FLAG: --boot-id-file="/proc/sys/kernel/random/boot_id"
I0612 21:10:32.122733       1 flags.go:59] FLAG: --cleanup="false"
I0612 21:10:32.122744       1 flags.go:59] FLAG: --cluster-cidr="fd12:3456:789a:1::/64"
I0612 21:10:32.122752       1 flags.go:59] FLAG: --config=""
I0612 21:10:32.122757       1 flags.go:59] FLAG: --config-sync-period="15m0s"
I0612 21:10:32.122799       1 flags.go:59] FLAG: --conntrack-max-per-core="131072"
I0612 21:10:32.122806       1 flags.go:59] FLAG: --conntrack-min="131072"
I0612 21:10:32.122811       1 flags.go:59] FLAG: --conntrack-tcp-timeout-close-wait="1h0m0s"
I0612 21:10:32.122817       1 flags.go:59] FLAG: --conntrack-tcp-timeout-established="24h0m0s"
I0612 21:10:32.122826       1 flags.go:59] FLAG: --detect-local-mode=""
I0612 21:10:32.122833       1 flags.go:59] FLAG: --feature-gates="IPv6DualStack=false"
I0612 21:10:32.122869       1 flags.go:59] FLAG: --healthz-bind-address="0.0.0.0:10256"
I0612 21:10:32.122878       1 flags.go:59] FLAG: --healthz-port="10256"
I0612 21:10:32.122886       1 flags.go:59] FLAG: --help="false"
I0612 21:10:32.122892       1 flags.go:59] FLAG: --hostname-override="ip-172-20-55-23.ap-southeast-2.compute.internal"
I0612 21:10:32.122904       1 flags.go:59] FLAG: --iptables-masquerade-bit="14"
I0612 21:10:32.122909       1 flags.go:59] FLAG: --iptables-min-sync-period="1s"
I0612 21:10:32.122939       1 flags.go:59] FLAG: --iptables-sync-period="30s"
I0612 21:10:32.122947       1 flags.go:59] FLAG: --ipvs-exclude-cidrs="[]"
I0612 21:10:32.122956       1 flags.go:59] FLAG: --ipvs-min-sync-period="0s"
I0612 21:10:32.122961       1 flags.go:59] FLAG: --ipvs-scheduler=""
I0612 21:10:32.122966       1 flags.go:59] FLAG: --ipvs-strict-arp="false"
I0612 21:10:32.122975       1 flags.go:59] FLAG: --ipvs-sync-period="30s"
I0612 21:10:32.122980       1 flags.go:59] FLAG: --ipvs-tcp-timeout="0s"
I0612 21:10:32.122994       1 flags.go:59] FLAG: --ipvs-tcpfin-timeout="0s"
I0612 21:10:32.123013       1 flags.go:59] FLAG: --ipvs-udp-timeout="0s"
I0612 21:10:32.123020       1 flags.go:59] FLAG: --kube-api-burst="10"
I0612 21:10:32.123025       1 flags.go:59] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0612 21:10:32.123034       1 flags.go:59] FLAG: --kube-api-qps="5"
I0612 21:10:32.123044       1 flags.go:59] FLAG: --kubeconfig="/var/lib/kube-proxy/kubeconfig"
I0612 21:10:32.123050       1 flags.go:59] FLAG: --log-backtrace-at=":0"
I0612 21:10:32.123081       1 flags.go:59] FLAG: --log-dir=""
I0612 21:10:32.123088       1 flags.go:59] FLAG: --log-file="/var/log/kube-proxy.log"
I0612 21:10:32.123093       1 flags.go:59] FLAG: --log-file-max-size="1800"
I0612 21:10:32.123098       1 flags.go:59] FLAG: --log-flush-frequency="5s"
I0612 21:10:32.123103       1 flags.go:59] FLAG: --logtostderr="false"
I0612 21:10:32.123112       1 flags.go:59] FLAG: --machine-id-file="/etc/machine-id,/var/lib/dbus/machine-id"
I0612 21:10:32.123119       1 flags.go:59] FLAG: --masquerade-all="false"
I0612 21:10:32.123123       1 flags.go:59] FLAG: --master="https://127.0.0.1"
I0612 21:10:32.123151       1 flags.go:59] FLAG: --metrics-bind-address="127.0.0.1:10249"
I0612 21:10:32.123160       1 flags.go:59] FLAG: --metrics-port="10249"
I0612 21:10:32.123166       1 flags.go:59] FLAG: --nodeport-addresses="[]"
I0612 21:10:32.123173       1 flags.go:59] FLAG: --one-output="false"
I0612 21:10:32.123183       1 flags.go:59] FLAG: --oom-score-adj="-998"
I0612 21:10:32.123188       1 flags.go:59] FLAG: --profiling="false"
I0612 21:10:32.123193       1 flags.go:59] FLAG: --proxy-mode=""
I0612 21:10:32.123230       1 flags.go:59] FLAG: --proxy-port-range=""
I0612 21:10:32.123238       1 flags.go:59] FLAG: --show-hidden-metrics-for-version=""
I0612 21:10:32.123242       1 flags.go:59] FLAG: --skip-headers="false"
I0612 21:10:32.123313       1 flags.go:59] FLAG: --skip-log-headers="false"
I0612 21:10:32.123324       1 flags.go:59] FLAG: --stderrthreshold="2"
I0612 21:10:32.123329       1 flags.go:59] FLAG: --udp-timeout="250ms"
I0612 21:10:32.123335       1 flags.go:59] FLAG: --v="2"
I0612 21:10:32.123365       1 flags.go:59] FLAG: --version="false"
I0612 21:10:32.123382       1 flags.go:59] FLAG: --vmodule=""
I0612 21:10:32.123390       1 flags.go:59] FLAG: --write-config-to=""
W0612 21:10:32.123397       1 server.go:220] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0612 21:10:32.123572       1 feature_gate.go:243] feature gates: &{map[IPv6DualStack:false]}
I0612 21:10:32.123838       1 feature_gate.go:243] feature gates: &{map[IPv6DualStack:false]}
I0612 21:10:32.158616       1 node.go:172] Successfully retrieved node IP: 172.20.55.23
I0612 21:10:32.158653       1 server_others.go:140] Detected node IP 172.20.55.23
W0612 21:10:32.158679       1 server_others.go:598] Unknown proxy mode "", assuming iptables proxy
I0612 21:10:32.158840       1 server_others.go:177] DetectLocalMode: 'ClusterCIDR'
I0612 21:10:32.177998       1 server_others.go:208] kube-proxy running in single-stack IPv4 mode
I0612 21:10:32.178089       1 server_others.go:212] Using iptables Proxier.
F0612 21:10:32.178129       1 server.go:489] unable to create proxier: CIDR fd12:3456:789a:1::/64 has incorrect IP version: expect isIPv6=false
day0ops commented 3 years ago

Looks like it works without feature flag IPv6DualStack: "false"

I0612 23:02:27.983244       1 server_others.go:206] kube-proxy running in dual-stack mode, IPv4-primary
I0612 23:02:27.983275       1 server_others.go:212] Using iptables Proxier.
I0612 23:02:27.983308       1 server_others.go:219] creating dualStackProxier for iptables.
W0612 23:02:27.983321       1 server_others.go:503] detect-local-mode set to ClusterCIDR, but no IPv4 cluster CIDR defined, defaulting to no-op detect-local for IPv4
I0612 23:02:27.983464       1 utils.go:375] Changed sysctl "net/ipv4/conf/all/route_localnet": 0 -> 1
I0612 23:02:27.983561       1 proxier.go:282] "using iptables mark for masquerade" ipFamily=IPv4 mark="0x00004000"
I0612 23:02:27.983640       1 proxier.go:330] "iptables sync params" ipFamily=IPv4 minSyncPeriod="1s" syncPeriod="30s" burstSyncs=2
I0612 23:02:27.983702       1 proxier.go:340] "iptables supports --random-fully" ipFamily=IPv4
I0612 23:02:27.983809       1 proxier.go:282] "using iptables mark for masquerade" ipFamily=IPv6 mark="0x00004000"
I0612 23:02:27.984007       1 proxier.go:330] "iptables sync params" ipFamily=IPv6 minSyncPeriod="1s" syncPeriod="30s" burstSyncs=2
I0612 23:02:27.984169       1 proxier.go:340] "iptables supports --random-fully" ipFamily=IPv6
I0612 23:02:27.984483       1 server.go:643] Version: v1.21.1
day0ops commented 3 years ago

That only means compatibility right now for a control plane that isnt IPv6 only. Am i right in assuming this ? In order for it to be IPv6 only I guess the api server will have to advertise as a IPv6 address ?

hakman commented 3 years ago

Dual-stack is enable only when you use --cluster-cidr="<IPv4CIDR>,<IPv6CIDR>". So, your cluster should be IPv6 only. The feature gate is just there to enable some code paths. Most likely the feature gate will be removed in k8s 1.23.

day0ops commented 3 years ago

May be Ive misunderstood this statement in the 1.21 docs then. "Starting in 1.21, IPv4/IPv6 dual-stack defaults to enabled" Ref https://kubernetes.io/docs/concepts/services-networking/dual-stack. So I thought it was needed to be disabled explicitly.

And also the fact that the proxy was failing with the error,

I0612 21:10:32.158840       1 server_others.go:177] DetectLocalMode: 'ClusterCIDR'
I0612 21:10:32.177998       1 server_others.go:208] kube-proxy running in single-stack IPv4 mode
I0612 21:10:32.178089       1 server_others.go:212] Using iptables Proxier.
F0612 21:10:32.178129       1 server.go:489] unable to create proxier: CIDR fd12:3456:789a:1::/64 has incorrect IP version: expect isIPv6=false
hakman commented 3 years ago

Feature gates are a way of introducing new features in a way that makes it easy to disable or remove later if they don't get to be finalized. https://kubernetes.io/blog/2020/08/21/moving-forward-from-beta/ The feature itself has its own enable / disable method and that is described here: https://kubernetes.io/docs/concepts/services-networking/dual-stack/#configure-ipv4-ipv6-dual-stack

In this case specifically, I don't know why you see that error when you set the feature flag to "false". IPv6 only mode should work without it, but I never tested that.

I hope it works as expected now.

day0ops commented 3 years ago

@hakman Appreciate your time with this. And yes it is working as expected now.

hakman commented 3 years ago

👍 If you don't mind, can you share a little about the use case? Thanks!

day0ops commented 3 years ago

We are trying to build an IPv6 only cluster to test the compatibility with Istio (both as single cluster and multi cluster topologies).

hakman commented 3 years ago

Are you planing on adding any other components to the cluster for NAT64/DNS64, or that is not a concern at the moment?

day0ops commented 3 years ago

No this isn't a concern as of yet. I know I closed this issue but there is another minor issue im running into with this setup which is related to the webhooks.

Looks like api server isnt able to access any webhooks due to,

W0613 23:30:11.389242       1 dispatcher.go:182] Failed calling webhook, failing closed webhook.cert-manager.io: failed calling webhook "webhook.cert-manager.io": Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": context deadline exceeded

If you have any ideas please let me know.

day0ops commented 3 years ago

I wonder if this is due to the fact that control plane is both IPv4 and IPv6 enabled and worker nodes are IPv6 enabled in CNI ?

hakman commented 3 years ago

Maybe try describing the cert-manager-webhook service and its endpoints? Any IPv4 or no endpoints at all?

day0ops commented 3 years ago

Here it is.

Name:              cert-manager-webhook
Namespace:         cert-manager
Labels:            app=webhook
                   app.kubernetes.io/component=webhook
                   app.kubernetes.io/instance=cert-manager
                   app.kubernetes.io/name=webhook
Annotations:       <none>
Selector:          app.kubernetes.io/component=webhook,app.kubernetes.io/instance=cert-manager,app.kubernetes.io/name=webhook
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv6
IP:                fd12:3456:789a::1838
IPs:               fd12:3456:789a::1838
Port:              https  443/TCP
TargetPort:        10888/TCP
Endpoints:         [fd12:3456:789a:1:f451:ce23:2e7e:1397]:10888
Session Affinity:  None
Events:            <none>
day0ops commented 3 years ago

Strangely dig AAAA cert-manager-webhook.cert-manager.svc +tcp results in NXDOMAIN with CoreDNS

[INFO] [fd12:3456:789a:1:f6b:daa7:766f:8583]:50637 - 56850 "AAAA IN cert-manager-webhook.cert-manager.svc. tcp 78 false 65535" NXDOMAIN qr,rd,ra,ad 130 0.004804612s

But dig AAAA cert-manager-webhook.cert-manager.svc.cluster.local +tcp returns [INFO] [fd12:3456:789a:1:f6b:daa7:766f:8583]:39779 - 5311 "AAAA IN cert-manager-webhook.cert-manager.svc.cluster.local. tcp 92 false 65535" NOERROR qr,aa,rd 148 0.000188346s

I couldnt see why the short alias would fail with CoreDNS even when I have included autopath @kubernetes

hakman commented 3 years ago

You can also add log to the Corefile and see all requests: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#are-dns-queries-being-received-processed

I see that cert-manager-webhook.cert-manager.svc. contains a . at the end which suggests it looks for an absolute domain name.

day0ops commented 3 years ago

@hakman. Thanks. Yup been scratching my head on this for a while. Found a more serious issue than the one above and I think there might be a bigger network/CNI issue going on.

Ive got a 1 master 2 worker node topology as in the OP above. But only one of the CoreDNS instances respond in the worker InstanceGroup. Any idea why this might be ?

dig AAAA cert-manager-webhook.cert-manager.svc.cluster.local +tcp @fd12:3456:789a:1:f451:ce23:2e7e:1398
;; Connection to fd12:3456:789a:1:f451:ce23:2e7e:1398#53(fd12:3456:789a:1:f451:ce23:2e7e:1398) for cert-manager-webhook.cert-manager.svc.cluster.local failed: timed out.
;; Connection to fd12:3456:789a:1:f451:ce23:2e7e:1398#53(fd12:3456:789a:1:f451:ce23:2e7e:1398) for cert-manager-webhook.cert-manager.svc.cluster.local failed: timed out.

; <<>> DiG 9.11.6-P1 <<>> AAAA cert-manager-webhook.cert-manager.svc.cluster.local +tcp @fd12:3456:789a:1:f451:ce23:2e7e:1398
;; global options: +cmd
;; connection timed out; no servers could be reached
;; Connection to fd12:3456:789a:1:f451:ce23:2e7e:1398#53(fd12:3456:789a:1:f451:ce23:2e7e:1398) for cert-manager-webhook.cert-manager.svc.cluster.local failed: timed out.
command terminated with exit code 9
dig AAAA cert-manager-webhook.cert-manager.svc.cluster.local +tcp @fd12:3456:789a:1:f6b:daa7:766f:8596

; <<>> DiG 9.11.6-P1 <<>> AAAA cert-manager-webhook.cert-manager.svc.cluster.local +tcp @fd12:3456:789a:1:f6b:daa7:766f:8596
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33016
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 39bf8f22503235fe (echoed)
;; QUESTION SECTION:
;cert-manager-webhook.cert-manager.svc.cluster.local. IN    AAAA

;; ANSWER SECTION:
cert-manager-webhook.cert-manager.svc.cluster.local. 30 IN AAAA fd12:3456:789a::1838

;; Query time: 2 msec
;; SERVER: fd12:3456:789a:1:f6b:daa7:766f:8596#53(fd12:3456:789a:1:f6b:daa7:766f:8596)
;; WHEN: Mon Jun 14 07:48:47 UTC 2021
;; MSG SIZE  rcvd: 171

But no issues as far as the nodes joining the cluster go

ip-172-20-33-186.ap-southeast-2.compute.internal   Ready    node                   21h   v1.21.1  Ubuntu 20.04.2 LTS   5.4.0-1045-aws   containerd://1.4.6
ip-172-20-35-233.ap-southeast-2.compute.internal   Ready    control-plane,master   21h   v1.21.1   Ubuntu 20.04.2 LTS   5.4.0-1045-aws   containerd://1.4.6
ip-172-20-46-213.ap-southeast-2.compute.internal   Ready    node                   21h   v1.21.1   Ubuntu 20.04.2 LTS   5.4.0-1045-aws   containerd://1.4.6
hakman commented 3 years ago

All I can say is that I ran the k8s conformance tests with 5 nodes and worked with few minor unrelated issues. I would suggest to delete the cluster and try a fresh one, see if you can reproduce. If you can, I can give it a try too in the next few days.

day0ops commented 3 years ago

Cool thanks @hakman.

hakman commented 3 years ago

One thought, I believe this is also needed:

  networking:
    calico:
      awsSrcDstCheck: Disable
day0ops commented 3 years ago

Ha thats exactly what i ended up doing and yes it worked

hakman commented 3 years ago

😄

day0ops commented 3 years ago

@hackman that only fixes one of the problems still though. Theres still a pending issue with the api server denying requests to any webhook endpoints. My assumption here is that control plane / master is on IPv4. Is there any way to force kubelet and the rest of the components to IPv6 ?

day0ops commented 3 years ago

Ive tried using hostNetwork: true but then running into this issue instead which i guess something to do with the CIDR pool

❯ k get svc -n cert-manager
NAME                   TYPE        CLUSTER-IP             EXTERNAL-IP   PORT(S)    AGE
cert-manager           ClusterIP   fd12:3456:789a::add    <none>        9402/TCP   9h
cert-manager-webhook   ClusterIP   fd12:3456:789a::d95d   <none>        443/TCP    9h
❯ k get ep -n cert-manager
NAME                   ENDPOINTS                                     AGE
cert-manager           [fd12:3456:789a:1:1e64:a71b:2a18:28ce]:9402   9h
cert-manager-webhook   <none>                                        9h
day0ops commented 3 years ago

@hakman I think we can close this. But I had to go down the route of using kubeadm to provision the cluster instead.

hakman commented 3 years ago

Sorry to hear that, but thanks for the feedback. Maybe you can try again when kOps 1.22 is "more ready".