kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.95k stars 4.65k forks source link

Cilium pods running on worker nodes keep restarting due to timeout while retrieving initial list of objects from kvstore #14649

Closed fbozic closed 1 year ago

fbozic commented 1 year ago

/kind bug

I'm trying to set up a new cluster on GCE with cilium-etcd networking.

1. What kops version are you running? The command kops version, will display this information.

Client version: 1.25.2 (git-v1.25.2)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:29:58Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

gce

4. What commands did you run? What is the simplest way to reproduce this issue?

5. What happened after the commands executed?

After 10 minutes cluster becomes healthy (from kops point of view), but cilium pods running on worker nodes keep restarting due to timeout while retrieving initial list of objects from kvstore. Cilium pods running on master nodes do not experience the same issues.

6. What did you expect to happen?

Cilium pods running on worker nodes are not continuously restarting.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: my-fake-name.k8s.local
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudConfig: {}
  cloudProvider: gce
  configBase: gs://my-fake-state-store/my-fake-name.k8s.local
  etcdClusters:
    - cpuRequest: 200m
      etcdMembers:
        - instanceGroup: master-europe-west3-a
          name: a
        - instanceGroup: master-europe-west3-b
          name: b
        - instanceGroup: master-europe-west3-c
          name: c
      memoryRequest: 100Mi
      name: main
    - cpuRequest: 100m
      etcdMembers:
        - instanceGroup: master-europe-west3-a
          name: a
        - instanceGroup: master-europe-west3-b
          name: b
        - instanceGroup: master-europe-west3-c
          name: c
      memoryRequest: 100Mi
      name: events
    - cpuRequest: 100m
      etcdMembers:
        - instanceGroup: master-europe-west3-a
          name: a
        - instanceGroup: master-europe-west3-b
          name: b
        - instanceGroup: master-europe-west3-c
          name: c
      manager:
        env:
          - name: ETCD_AUTO_COMPACTION_MODE
            value: revision
          - name: ETCD_AUTO_COMPACTION_RETENTION
            value: "2500"
      memoryRequest: 100Mi
      name: cilium
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeProxy:
    enabled: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
    - 0.0.0.0/0
    - ::/0
  kubernetesVersion: 1.25.4
  masterPublicName: api.my-fake-name.k8s.local
  networkID: my-network
  networking:
    cilium:
      enableNodePort: true
      etcdManaged: true
  nonMasqueradeCIDR: 100.64.0.0/10
  project: my-project
  sshAccess:
    - 0.0.0.0/0
    - ::/0
  subnets:
    - cidr: 10.0.32.0/20
      name: europe-west3
      region: europe-west3
      type: Private
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: my-fake-name.k8s.local
  name: master-europe-west3-a
spec:
  image: ubuntu-os-cloud/ubuntu-2004-focal-v20221018
  machineType: n1-standard-8
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
    - europe-west3
  zones:
    - europe-west3-a

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: my-fake-name.k8s.local
  name: master-europe-west3-b
spec:
  image: ubuntu-os-cloud/ubuntu-2004-focal-v20221018
  machineType: n1-standard-8
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
    - europe-west3
  zones:
    - europe-west3-b

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: my-fake-name.k8s.local
  name: master-europe-west3-c
spec:
  image: ubuntu-os-cloud/ubuntu-2004-focal-v20221018
  machineType: n1-standard-8
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
    - europe-west3
  zones:
    - europe-west3-c

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: my-fake-name.k8s.local
  name: nodes-europe-west3
spec:
  image: ubuntu-os-cloud/ubuntu-2004-focal-v20221018
  machineType: n1-standard-8
  maxSize: 6
  minSize: 3
  role: Node
  subnets:
    - europe-west3
  zones:
    - europe-west3-a
    - europe-west3-b
    - europe-west3-c

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

During cluster provisioning, everything looks okay in the kops output so I will not attach it. Cilium pod status after ~1 hour.

❯ kgp -n kube-system -l "k8s-app=cilium" -o wide
NAME           READY   STATUS    RESTARTS        AGE   IP          NODE                         NOMINATED NODE   READINESS GATES
cilium-6sbj5   1/1     Running   0               63m   10.0.32.5   master-europe-west3-a-ndrt   <none>           <none>
cilium-9pdwk   0/1     Running   12 (2m8s ago)   62m   10.0.32.2   nodes-europe-west3-1z9l      <none>           <none>
cilium-kcjzl   1/1     Running   0               63m   10.0.32.7   master-europe-west3-b-1dr4   <none>           <none>
cilium-qtsj2   0/1     Running   12 (116s ago)   62m   10.0.32.3   nodes-europe-west3-v4b8      <none>           <none>
cilium-sskvp   0/1     Running   12 (119s ago)   62m   10.0.32.4   nodes-europe-west3-9679      <none>           <none>
cilium-t8cxv   1/1     Running   0               63m   10.0.32.6   master-europe-west3-c-4zqx   <none>           <none>

Logs from cilium-9pdwk contain msg="Unable to initialize local node. Retrying..." error="timeout while retrieving initial list of objects from kvstore"

level=info msg="Started gops server" address="127.0.0.1:9890" subsys=daemon
level=info msg="  --agent-health-port='9879'" subsys=daemon
level=info msg="  --agent-labels=''" subsys=daemon
level=info msg="  --agent-not-ready-taint-key='node.cilium.io/agent-not-ready'" subsys=daemon
level=info msg="  --allocator-list-timeout='3m0s'" subsys=daemon
level=info msg="  --allow-icmp-frag-needed='true'" subsys=daemon
level=info msg="  --allow-localhost='auto'" subsys=daemon
level=info msg="  --annotate-k8s-node='true'" subsys=daemon
level=info msg="  --api-rate-limit=''" subsys=daemon
level=info msg="  --arping-refresh-period='30s'" subsys=daemon
level=info msg="  --auto-create-cilium-node-resource='true'" subsys=daemon
level=info msg="  --auto-direct-node-routes='false'" subsys=daemon
level=info msg="  --bgp-announce-lb-ip='false'" subsys=daemon
level=info msg="  --bgp-announce-pod-cidr='false'" subsys=daemon
level=info msg="  --bgp-config-path='/var/lib/cilium/bgp/config.yaml'" subsys=daemon
level=info msg="  --bpf-ct-global-any-max='262144'" subsys=daemon
level=info msg="  --bpf-ct-global-tcp-max='524288'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-any='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp='6h0m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp-fin='10s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp-syn='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-service-any='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-service-tcp='6h0m0s'" subsys=daemon
level=info msg="  --bpf-fragments-map-max='8192'" subsys=daemon
level=info msg="  --bpf-lb-acceleration='disabled'" subsys=daemon
level=info msg="  --bpf-lb-algorithm='random'" subsys=daemon
level=info msg="  --bpf-lb-bypass-fib-lookup='false'" subsys=daemon
level=info msg="  --bpf-lb-dev-ip-addr-inherit=''" subsys=daemon
level=info msg="  --bpf-lb-dsr-dispatch='opt'" subsys=daemon
level=info msg="  --bpf-lb-dsr-l4-xlate='frontend'" subsys=daemon
level=info msg="  --bpf-lb-external-clusterip='false'" subsys=daemon
level=info msg="  --bpf-lb-maglev-hash-seed='JLfvgnHc2kaSUFaI'" subsys=daemon
level=info msg="  --bpf-lb-maglev-table-size='16381'" subsys=daemon
level=info msg="  --bpf-lb-map-max='65536'" subsys=daemon
level=info msg="  --bpf-lb-mode='snat'" subsys=daemon
level=info msg="  --bpf-lb-preserve-world-id='false'" subsys=daemon
level=info msg="  --bpf-lb-rss-ipv4-src-cidr=''" subsys=daemon
level=info msg="  --bpf-lb-rss-ipv6-src-cidr=''" subsys=daemon
level=info msg="  --bpf-lb-sock-hostns-only='false'" subsys=daemon
level=info msg="  --bpf-map-dynamic-size-ratio='0'" subsys=daemon
level=info msg="  --bpf-nat-global-max='524288'" subsys=daemon
level=info msg="  --bpf-neigh-global-max='524288'" subsys=daemon
level=info msg="  --bpf-policy-map-max='16384'" subsys=daemon
level=info msg="  --bpf-root=''" subsys=daemon
level=info msg="  --bpf-sock-rev-map-max='262144'" subsys=daemon
level=info msg="  --bypass-ip-availability-upon-restore='false'" subsys=daemon
level=info msg="  --certificates-directory='/var/run/cilium/certs'" subsys=daemon
level=info msg="  --cflags=''" subsys=daemon
level=info msg="  --cgroup-root='/run/cilium/cgroupv2'" subsys=daemon
level=info msg="  --cluster-health-port='4240'" subsys=daemon
level=info msg="  --cluster-id='0'" subsys=daemon
level=info msg="  --cluster-name='default'" subsys=daemon
level=info msg="  --clustermesh-config='/var/lib/cilium/clustermesh/'" subsys=daemon
level=info msg="  --cmdref=''" subsys=daemon
level=info msg="  --config=''" subsys=daemon
level=info msg="  --config-dir='/tmp/cilium/config-map'" subsys=daemon
level=info msg="  --conntrack-gc-interval='0s'" subsys=daemon
level=info msg="  --crd-wait-timeout='5m0s'" subsys=daemon
level=info msg="  --datapath-mode='veth'" subsys=daemon
level=info msg="  --debug='false'" subsys=daemon
level=info msg="  --debug-verbose=''" subsys=daemon
level=info msg="  --derive-masquerade-ip-addr-from-device=''" subsys=daemon
level=info msg="  --devices=''" subsys=daemon
level=info msg="  --direct-routing-device=''" subsys=daemon
level=info msg="  --disable-cnp-status-updates='true'" subsys=daemon
level=info msg="  --disable-conntrack='false'" subsys=daemon
level=info msg="  --disable-endpoint-crd='false'" subsys=daemon
level=info msg="  --disable-envoy-version-check='false'" subsys=daemon
level=info msg="  --disable-iptables-feeder-rules=''" subsys=daemon
level=info msg="  --dns-max-ips-per-restored-rule='1000'" subsys=daemon
level=info msg="  --dnsproxy-concurrency-limit='0'" subsys=daemon
level=info msg="  --dnsproxy-concurrency-processing-grace-period='0s'" subsys=daemon
level=info msg="  --egress-masquerade-interfaces=''" subsys=daemon
level=info msg="  --egress-multi-home-ip-rule-compat='false'" subsys=daemon
level=info msg="  --enable-auto-protect-node-port-range='true'" subsys=daemon
level=info msg="  --enable-bandwidth-manager='false'" subsys=daemon
level=info msg="  --enable-bpf-clock-probe='false'" subsys=daemon
level=info msg="  --enable-bpf-masquerade='false'" subsys=daemon
level=info msg="  --enable-bpf-tproxy='false'" subsys=daemon
level=info msg="  --enable-cilium-endpoint-slice='false'" subsys=daemon
level=info msg="  --enable-custom-calls='false'" subsys=daemon
level=info msg="  --enable-endpoint-health-checking='true'" subsys=daemon
level=info msg="  --enable-endpoint-routes='false'" subsys=daemon
level=info msg="  --enable-external-ips='true'" subsys=daemon
level=info msg="  --enable-health-check-nodeport='true'" subsys=daemon
level=info msg="  --enable-health-checking='true'" subsys=daemon
level=info msg="  --enable-host-firewall='false'" subsys=daemon
level=info msg="  --enable-host-legacy-routing='false'" subsys=daemon
level=info msg="  --enable-host-port='true'" subsys=daemon
level=info msg="  --enable-host-reachable-services='false'" subsys=daemon
level=info msg="  --enable-hubble='false'" subsys=daemon
level=info msg="  --enable-hubble-recorder-api='true'" subsys=daemon
level=info msg="  --enable-icmp-rules='false'" subsys=daemon
level=info msg="  --enable-identity-mark='true'" subsys=daemon
level=info msg="  --enable-ip-masq-agent='false'" subsys=daemon
level=info msg="  --enable-ipsec='false'" subsys=daemon
level=info msg="  --enable-ipv4='true'" subsys=daemon
level=info msg="  --enable-ipv4-egress-gateway='false'" subsys=daemon
level=info msg="  --enable-ipv4-fragment-tracking='true'" subsys=daemon
level=info msg="  --enable-ipv4-masquerade='true'" subsys=daemon
level=info msg="  --enable-ipv6='false'" subsys=daemon
level=info msg="  --enable-ipv6-masquerade='false'" subsys=daemon
level=info msg="  --enable-ipv6-ndp='false'" subsys=daemon
level=info msg="  --enable-k8s-api-discovery='false'" subsys=daemon
level=info msg="  --enable-k8s-endpoint-slice='true'" subsys=daemon
level=info msg="  --enable-k8s-event-handover='true'" subsys=daemon
level=info msg="  --enable-k8s-terminating-endpoint='true'" subsys=daemon
level=info msg="  --enable-l2-neigh-discovery='true'" subsys=daemon
level=info msg="  --enable-l7-proxy='true'" subsys=daemon
level=info msg="  --enable-local-node-route='true'" subsys=daemon
level=info msg="  --enable-local-redirect-policy='false'" subsys=daemon
level=info msg="  --enable-mke='false'" subsys=daemon
level=info msg="  --enable-monitor='true'" subsys=daemon
level=info msg="  --enable-node-port='true'" subsys=daemon
level=info msg="  --enable-policy='default'" subsys=daemon
level=info msg="  --enable-recorder='false'" subsys=daemon
level=info msg="  --enable-remote-node-identity='true'" subsys=daemon
level=info msg="  --enable-selective-regeneration='true'" subsys=daemon
level=info msg="  --enable-service-topology='false'" subsys=daemon
level=info msg="  --enable-session-affinity='false'" subsys=daemon
level=info msg="  --enable-svc-source-range-check='true'" subsys=daemon
level=info msg="  --enable-tracing='false'" subsys=daemon
level=info msg="  --enable-well-known-identities='true'" subsys=daemon
level=info msg="  --enable-wireguard='false'" subsys=daemon
level=info msg="  --enable-wireguard-userspace-fallback='false'" subsys=daemon
level=info msg="  --enable-xdp-prefilter='false'" subsys=daemon
level=info msg="  --enable-xt-socket-fallback='true'" subsys=daemon
level=info msg="  --encrypt-interface=''" subsys=daemon
level=info msg="  --encrypt-node='false'" subsys=daemon
level=info msg="  --endpoint-gc-interval='5m0s'" subsys=daemon
level=info msg="  --endpoint-interface-name-prefix='lxc+'" subsys=daemon
level=info msg="  --endpoint-queue-size='25'" subsys=daemon
level=info msg="  --endpoint-status=''" subsys=daemon
level=info msg="  --envoy-log=''" subsys=daemon
level=info msg="  --exclude-local-address=''" subsys=daemon
level=info msg="  --fixed-identity-mapping=''" subsys=daemon
level=info msg="  --force-local-policy-eval-at-source='true'" subsys=daemon
level=info msg="  --fqdn-regex-compile-lru-size='1024'" subsys=daemon
level=info msg="  --gops-port='9890'" subsys=daemon
level=info msg="  --host-reachable-services-protos='tcp,udp'" subsys=daemon
level=info msg="  --http-403-msg=''" subsys=daemon
level=info msg="  --http-idle-timeout='0'" subsys=daemon
level=info msg="  --http-max-grpc-timeout='0'" subsys=daemon
level=info msg="  --http-normalize-path='true'" subsys=daemon
level=info msg="  --http-request-timeout='3600'" subsys=daemon
level=info msg="  --http-retry-count='3'" subsys=daemon
level=info msg="  --http-retry-timeout='0'" subsys=daemon
level=info msg="  --hubble-disable-tls='false'" subsys=daemon
level=info msg="  --hubble-event-buffer-capacity='4095'" subsys=daemon
level=info msg="  --hubble-event-queue-size='0'" subsys=daemon
level=info msg="  --hubble-export-file-compress='false'" subsys=daemon
level=info msg="  --hubble-export-file-max-backups='5'" subsys=daemon
level=info msg="  --hubble-export-file-max-size-mb='10'" subsys=daemon
level=info msg="  --hubble-export-file-path=''" subsys=daemon
level=info msg="  --hubble-listen-address=''" subsys=daemon
level=info msg="  --hubble-metrics=''" subsys=daemon
level=info msg="  --hubble-metrics-server=''" subsys=daemon
level=info msg="  --hubble-recorder-sink-queue-size='1024'" subsys=daemon
level=info msg="  --hubble-recorder-storage-path='/var/run/cilium/pcaps'" subsys=daemon
level=info msg="  --hubble-socket-path='/var/run/cilium/hubble.sock'" subsys=daemon
level=info msg="  --hubble-tls-cert-file=''" subsys=daemon
level=info msg="  --hubble-tls-client-ca-files=''" subsys=daemon
level=info msg="  --hubble-tls-key-file=''" subsys=daemon
level=info msg="  --identity-allocation-mode='crd'" subsys=daemon
level=info msg="  --identity-change-grace-period='5s'" subsys=daemon
level=info msg="  --identity-restore-grace-period='10m0s'" subsys=daemon
level=info msg="  --install-iptables-rules='true'" subsys=daemon
level=info msg="  --install-no-conntrack-iptables-rules='false'" subsys=daemon
level=info msg="  --ip-allocation-timeout='2m0s'" subsys=daemon
level=info msg="  --ip-masq-agent-config-path='/etc/config/ip-masq-agent'" subsys=daemon
level=info msg="  --ipam='kubernetes'" subsys=daemon
level=info msg="  --ipsec-key-file=''" subsys=daemon
level=info msg="  --iptables-lock-timeout='5s'" subsys=daemon
level=info msg="  --iptables-random-fully='false'" subsys=daemon
level=info msg="  --ipv4-native-routing-cidr=''" subsys=daemon
level=info msg="  --ipv4-node='auto'" subsys=daemon
level=info msg="  --ipv4-pod-subnets=''" subsys=daemon
level=info msg="  --ipv4-range='auto'" subsys=daemon
level=info msg="  --ipv4-service-loopback-address='169.254.42.1'" subsys=daemon
level=info msg="  --ipv4-service-range='auto'" subsys=daemon
level=info msg="  --ipv6-cluster-alloc-cidr='f00d::/64'" subsys=daemon
level=info msg="  --ipv6-mcast-device=''" subsys=daemon
level=info msg="  --ipv6-node='auto'" subsys=daemon
level=info msg="  --ipv6-pod-subnets=''" subsys=daemon
level=info msg="  --ipv6-range='auto'" subsys=daemon
level=info msg="  --ipv6-service-range='auto'" subsys=daemon
level=info msg="  --ipvlan-master-device='undefined'" subsys=daemon
level=info msg="  --join-cluster='false'" subsys=daemon
level=info msg="  --k8s-api-server=''" subsys=daemon
level=info msg="  --k8s-heartbeat-timeout='30s'" subsys=daemon
level=info msg="  --k8s-kubeconfig-path=''" subsys=daemon
level=info msg="  --k8s-namespace='kube-system'" subsys=daemon
level=info msg="  --k8s-require-ipv4-pod-cidr='false'" subsys=daemon
level=info msg="  --k8s-require-ipv6-pod-cidr='false'" subsys=daemon
level=info msg="  --k8s-service-cache-size='128'" subsys=daemon
level=info msg="  --k8s-service-proxy-name=''" subsys=daemon
level=info msg="  --k8s-sync-timeout='3m0s'" subsys=daemon
level=info msg="  --k8s-watcher-endpoint-selector='metadata.name!=kube-scheduler,metadata.name!=kube-controller-manager,metadata.name!=etcd-operator,metadata.name!=gcp-controller-manager'" subsys=daemon
level=info msg="  --keep-config='false'" subsys=daemon
level=info msg="  --kube-proxy-replacement='strict'" subsys=daemon
level=info msg="  --kube-proxy-replacement-healthz-bind-address=''" subsys=daemon
level=info msg="  --kvstore='etcd'" subsys=daemon
level=info msg="  --kvstore-connectivity-timeout='2m0s'" subsys=daemon
level=info msg="  --kvstore-lease-ttl='15m0s'" subsys=daemon
level=info msg="  --kvstore-max-consecutive-quorum-errors='2'" subsys=daemon
level=info msg="  --kvstore-opt='{\"etcd.config\":,\"/var/lib/etcd-config/etcd.config\"}'" subsys=daemon
level=info msg="  --kvstore-periodic-sync='5m0s'" subsys=daemon
level=info msg="  --label-prefix-file=''" subsys=daemon
level=info msg="  --labels=''" subsys=daemon
level=info msg="  --lib-dir='/var/lib/cilium'" subsys=daemon
level=info msg="  --local-router-ipv4=''" subsys=daemon
level=info msg="  --local-router-ipv6=''" subsys=daemon
level=info msg="  --log-driver=''" subsys=daemon
level=info msg="  --log-opt=''" subsys=daemon
level=info msg="  --log-system-load='false'" subsys=daemon
level=info msg="  --max-controller-interval='0'" subsys=daemon
level=info msg="  --metrics=''" subsys=daemon
level=info msg="  --mke-cgroup-mount=''" subsys=daemon
level=info msg="  --monitor-aggregation='medium'" subsys=daemon
level=info msg="  --monitor-aggregation-flags='syn,fin,rst'" subsys=daemon
level=info msg="  --monitor-aggregation-interval='5s'" subsys=daemon
level=info msg="  --monitor-queue-size='0'" subsys=daemon
level=info msg="  --mtu='0'" subsys=daemon
level=info msg="  --nat46-range='0:0:0:0:0:FFFF::/96'" subsys=daemon
level=info msg="  --native-routing-cidr=''" subsys=daemon
level=info msg="  --node-port-acceleration='disabled'" subsys=daemon
level=info msg="  --node-port-algorithm='random'" subsys=daemon
level=info msg="  --node-port-bind-protection='true'" subsys=daemon
level=info msg="  --node-port-mode='snat'" subsys=daemon
level=info msg="  --node-port-range='30000,32767'" subsys=daemon
level=info msg="  --policy-audit-mode='false'" subsys=daemon
level=info msg="  --policy-queue-size='100'" subsys=daemon
level=info msg="  --policy-trigger-interval='1s'" subsys=daemon
level=info msg="  --pprof='false'" subsys=daemon
level=info msg="  --pprof-port='6060'" subsys=daemon
level=info msg="  --preallocate-bpf-maps='false'" subsys=daemon
level=info msg="  --prefilter-device='undefined'" subsys=daemon
level=info msg="  --prefilter-mode='native'" subsys=daemon
level=info msg="  --prepend-iptables-chains='true'" subsys=daemon
level=info msg="  --prometheus-serve-addr=''" subsys=daemon
level=info msg="  --proxy-connect-timeout='1'" subsys=daemon
level=info msg="  --proxy-gid='1337'" subsys=daemon
level=info msg="  --proxy-prometheus-port='0'" subsys=daemon
level=info msg="  --read-cni-conf=''" subsys=daemon
level=info msg="  --restore='true'" subsys=daemon
level=info msg="  --route-metric='0'" subsys=daemon
level=info msg="  --sidecar-istio-proxy-image='cilium/istio_proxy'" subsys=daemon
level=info msg="  --single-cluster-route='false'" subsys=daemon
level=info msg="  --socket-path='/var/run/cilium/cilium.sock'" subsys=daemon
level=info msg="  --sockops-enable='false'" subsys=daemon
level=info msg="  --state-dir='/var/run/cilium'" subsys=daemon
level=info msg="  --tofqdns-dns-reject-response-code='refused'" subsys=daemon
level=info msg="  --tofqdns-enable-dns-compression='true'" subsys=daemon
level=info msg="  --tofqdns-endpoint-max-ip-per-hostname='50'" subsys=daemon
level=info msg="  --tofqdns-idle-connection-grace-period='0s'" subsys=daemon
level=info msg="  --tofqdns-max-deferred-connection-deletes='10000'" subsys=daemon
level=info msg="  --tofqdns-min-ttl='0'" subsys=daemon
level=info msg="  --tofqdns-pre-cache=''" subsys=daemon
level=info msg="  --tofqdns-proxy-port='0'" subsys=daemon
level=info msg="  --tofqdns-proxy-response-max-delay='100ms'" subsys=daemon
level=info msg="  --trace-payloadlen='128'" subsys=daemon
level=info msg="  --tunnel='vxlan'" subsys=daemon
level=info msg="  --tunnel-port='0'" subsys=daemon
level=info msg="  --version='false'" subsys=daemon
level=info msg="  --vlan-bpf-bypass=''" subsys=daemon
level=info msg="  --write-cni-conf-when-ready=''" subsys=daemon
level=info msg="     _ _ _" subsys=daemon
level=info msg=" ___|_| |_|_ _ _____" subsys=daemon
level=info msg="|  _| | | | | |     |" subsys=daemon
level=info msg="|___|_|_|_|___|_|_|_|" subsys=daemon
level=info msg="Cilium 1.11.8 ff1ed93 2022-08-12T17:17:38+02:00 go version go1.17.13 linux/amd64" subsys=daemon
level=info msg="cilium-envoy  version: 3b70fad0b9514720f33db82841907821202c1f02/1.21.2/Distribution/RELEASE/BoringSSL" subsys=daemon
level=info msg="clang (10.0.0) and kernel (5.15.0) versions: OK!" subsys=linux-datapath
level=info msg="linking environment: OK!" subsys=linux-datapath
level=info msg="Detected mounted BPF filesystem at /sys/fs/bpf" subsys=bpf
level=info msg="Mounted cgroupv2 filesystem at /run/cilium/cgroupv2" subsys=cgroups
level=info msg="Parsing base label prefixes from default label list" subsys=labels-filter
level=info msg="Parsing additional label prefixes from user inputs: []" subsys=labels-filter
level=info msg="Final label prefixes to be used for identity evaluation:" subsys=labels-filter
level=info msg=" - reserved:.*" subsys=labels-filter
level=info msg=" - :io\\.kubernetes\\.pod\\.namespace" subsys=labels-filter
level=info msg=" - :io\\.cilium\\.k8s\\.namespace\\.labels" subsys=labels-filter
level=info msg=" - :app\\.kubernetes\\.io" subsys=labels-filter
level=info msg=" - !:io\\.kubernetes" subsys=labels-filter
level=info msg=" - !:kubernetes\\.io" subsys=labels-filter
level=info msg=" - !:.*beta\\.kubernetes\\.io" subsys=labels-filter
level=info msg=" - !:k8s\\.io" subsys=labels-filter
level=info msg=" - !:pod-template-generation" subsys=labels-filter
level=info msg=" - !:pod-template-hash" subsys=labels-filter
level=info msg=" - !:controller-revision-hash" subsys=labels-filter
level=info msg=" - !:annotation.*" subsys=labels-filter
level=info msg=" - !:etcd_node" subsys=labels-filter
level=info msg="Using autogenerated IPv4 allocation range" subsys=node v4Prefix=10.2.0.0/16
level=info msg="Initializing daemon" subsys=daemon
level=info msg="Establishing connection to apiserver" host="https://api.internal.cilium.k8s.local:443" subsys=k8s
level=info msg="Connected to apiserver" subsys=k8s
level=info msg="Trying to auto-enable \"enable-node-port\", \"enable-external-ips\", \"enable-host-reachable-services\", \"enable-host-port\", \"enable-session-affinity\" features" subsys=daemon
level=info msg="Inheriting MTU from external network interface" device=ens4 ipAddr=10.0.32.2 mtu=1460 subsys=mtu
level=info msg="Restored services from maps" failed=0 restored=3 subsys=service
level=info msg="Reading old endpoints..." subsys=daemon
level=info msg="Envoy: Starting xDS gRPC server listening on /var/run/cilium/xds.sock" subsys=envoy-manager
level=info msg="Reusing previous DNS proxy port: 42657" subsys=daemon
level=info msg="Waiting until all Cilium CRDs are available" subsys=k8s
level=info msg="All Cilium CRDs have been found and are available" subsys=k8s
level=info msg="Retrieved node information from kubernetes node" nodeName=nodes-europe-west3-1z9l subsys=k8s
level=info msg="Received own node information from API server" ipAddr.ipv4=10.0.32.2 ipAddr.ipv6="<nil>" k8sNodeIP=10.0.32.2 labels="map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:n1-standard-8 beta.kubernetes.io/os:linux failure-domain.beta.kubernetes.io/region:europe-west3 failure-domain.beta.kubernetes.io/zone:europe-west3-a kubernetes.io/arch:amd64 kubernetes.io/hostname:nodes-europe-west3-1z9l kubernetes.io/os:linux node-role.kubernetes.io/node: node.kubernetes.io/instance-type:n1-standard-8 topology.gke.io/zone:europe-west3-a topology.kubernetes.io/region:europe-west3 topology.kubernetes.io/zone:europe-west3-a]" nodeName=nodes-europe-west3-1z9l subsys=k8s v4Prefix=100.96.4.0/24 v6Prefix="<nil>"
level=info msg="Restored router IPs from node information" ipv4=100.96.4.18 ipv6="<nil>" subsys=k8s
level=info msg="k8s mode: Allowing localhost to reach local endpoints" subsys=daemon
level=info msg="Direct routing device detected" direct-routing-device=ens4 subsys=daemon
level=info msg="Detected devices" devices="[ens4]" subsys=daemon
level=info msg="BPF host routing requires enable-bpf-masquerade. Falling back to legacy host routing (enable-host-legacy-routing=true)." subsys=daemon
level=info msg="Enabling k8s event listener" subsys=k8s-watcher
level=info msg="Removing stale endpoint interfaces" subsys=daemon
level=info msg="Waiting until all pre-existing resources have been received" subsys=k8s-watcher
level=info msg="Creating etcd client" ConfigPath=/var/lib/etcd-config/etcd.config KeepAliveHeartbeat=15s KeepAliveTimeout=25s RateLimit=20 subsys=kvstore
level=info msg="Connecting to etcd server..." config=/var/lib/etcd-config/etcd.config endpoints="[https://api.internal.cilium.k8s.local:4003]" subsys=kvstore
level=info msg="Restored router address from node_config" file=/var/run/cilium/state/globals/node_config.h ipv4=100.96.4.18 ipv6="<nil>" subsys=node
level=info msg="Initializing node addressing" subsys=daemon
level=info msg="Initializing kubernetes IPAM" subsys=ipam v4Prefix=100.96.4.0/24 v6Prefix="<nil>"
level=info msg="Restoring endpoints..." subsys=daemon
level=info msg="Endpoints restored" failed=0 restored=3 subsys=daemon
level=info msg="Addressing information:" subsys=daemon
level=info msg="  Cluster-Name: default" subsys=daemon
level=info msg="  Cluster-ID: 0" subsys=daemon
level=info msg="  Local node-name: nodes-europe-west3-1z9l" subsys=daemon
level=info msg="  Node-IPv6: <nil>" subsys=daemon
level=info msg="  External-Node IPv4: 10.0.32.2" subsys=daemon
level=info msg="  Internal-Node IPv4: 100.96.4.18" subsys=daemon
level=info msg="  IPv4 allocation prefix: 100.96.4.0/24" subsys=daemon
level=info msg="  Loopback IPv4: 169.254.42.1" subsys=daemon
level=info msg="  Local IPv4 addresses:" subsys=daemon
level=info msg="  - 10.0.32.2" subsys=daemon
level=info msg="  - 100.96.4.18" subsys=daemon
level=info msg="Creating or updating CiliumNode resource" node=nodes-europe-west3-1z9l subsys=nodediscovery
level=info msg="Adding local node to cluster" node="{nodes-europe-west3-1z9l default [{InternalIP 10.0.32.2} {CiliumInternalIP 100.96.4.18}] 100.96.4.0/24 <nil> 100.96.4.187 <nil> 0 local 0 map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:n1-standard-8 beta.kubernetes.io/os:linux failure-domain.beta.kubernetes.io/region:europe-west3 failure-domain.beta.kubernetes.io/zone:europe-west3-a kubernetes.io/arch:amd64 kubernetes.io/hostname:nodes-europe-west3-1z9l kubernetes.io/os:linux node-role.kubernetes.io/node: node.kubernetes.io/instance-type:n1-standard-8 topology.gke.io/zone:europe-west3-a topology.kubernetes.io/region:europe-west3 topology.kubernetes.io/zone:europe-west3-a] 1 }" subsys=nodediscovery
level=info msg="All pre-existing resources have been received; continuing" subsys=k8s-watcher
level=info msg="Annotating k8s node" subsys=daemon v4CiliumHostIP.IPv4=100.96.4.18 v4Prefix=100.96.4.0/24 v4healthIP.IPv4=100.96.4.187 v6CiliumHostIP.IPv6="<nil>" v6Prefix="<nil>" v6healthIP.IPv6="<nil>"
level=info msg="Initializing identity allocator" subsys=identity-cache
level=info msg="Cluster-ID is not specified, skipping ClusterMesh initialization" subsys=daemon
level=info msg="Setting up BPF datapath" bpfClockSource=ktime bpfInsnSet=v3 subsys=datapath-loader
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.core.bpf_jit_enable sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.all.rp_filter sysParamValue=0
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.fib_multipath_use_neigh sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=kernel.unprivileged_bpf_disabled sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=kernel.timer_migration sysParamValue=0
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.forwarding sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.rp_filter sysParamValue=0
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.accept_local sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_host.send_redirects sysParamValue=0
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.forwarding sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.rp_filter sysParamValue=0
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.accept_local sysParamValue=1
level=info msg="Setting sysctl" subsys=sysctl sysParamName=net.ipv4.conf.cilium_net.send_redirects sysParamValue=0
level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
level=info msg="regenerating all endpoints" reason="kube-apiserver identity updated" subsys=endpoint-manager
level=info msg="regenerating all endpoints" reason="kube-apiserver identity updated" subsys=endpoint-manager
level=info msg="regenerating all endpoints" reason="kube-apiserver identity updated" subsys=endpoint-manager
level=info msg="Iptables rules installed" subsys=iptables
level=info msg="Adding new proxy port rules for cilium-dns-egress:42657" proxy port name=cilium-dns-egress subsys=proxy
level=info msg="Iptables proxy rules installed" subsys=iptables
level=info msg="Serving cilium node monitor v1.2 API at unix:///var/run/cilium/monitor1_2.sock" subsys=monitor-agent
level=info msg="Validating configured node address ranges" subsys=daemon
level=info msg="Starting connection tracking garbage collector" subsys=daemon
level=info msg="Starting IP identity watcher" subsys=ipcache
level=info msg="Initial scan of connection tracking completed" subsys=ct-gc
level=info msg="Regenerating restored endpoints" numRestored=3 subsys=daemon
level=info msg="Conntrack garbage collector interval recalculated" deleteRatio=0.0003662109375 newInterval=7m30s subsys=map-ct
level=info msg="Datapath signal listener running" subsys=signal
level=info msg="New endpoint" containerID=a717e743ce datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=1544 identity=22440 ipv4=100.96.4.72 ipv6= k8sPodName=kube-system/coredns-77bcddd996-9zggf subsys=endpoint
level=info msg="New endpoint" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=89 identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=info msg="New endpoint" containerID=8db7bc5171 datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=83 identity=9498 ipv4=100.96.4.177 ipv6= k8sPodName=kube-system/coredns-autoscaler-975545559-hqfrm subsys=endpoint
level=info msg="Successfully restored endpoint. Scheduling regeneration" endpointID=83 subsys=daemon
level=info msg="Successfully restored endpoint. Scheduling regeneration" endpointID=89 subsys=daemon
level=info msg="Successfully restored endpoint. Scheduling regeneration" endpointID=1544 subsys=daemon
level=info msg="Reusing existing global key" key="k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=coredns;k8s:io.kubernetes.pod.namespace=kube-system;k8s:k8s-app=kube-dns;k8s:kops.k8s.io/managed-by=kops;" subsys=allocator
level=info msg="Reusing existing global key" key="k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=coredns-autoscaler;k8s:io.kubernetes.pod.namespace=kube-system;k8s:k8s-app=coredns-autoscaler;k8s:kops.k8s.io/managed-by=kops;" subsys=allocator
level=info msg="Removed endpoint" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2066 identity=4 ipv4=100.96.4.125 ipv6= k8sPodName=/ subsys=endpoint
level=info msg="Launching Cilium health daemon" subsys=daemon
level=info msg="Launching Cilium health endpoint" subsys=daemon
level=info msg="Started healthz status API server" address="127.0.0.1:9879" subsys=daemon
level=info msg="Initializing Cilium API" subsys=daemon
level=info msg="Daemon initialization completed" bootstrapTime=15.796690621s subsys=daemon
level=info msg="Hubble server is disabled" subsys=hubble
level=info msg="Serving cilium API at unix:///var/run/cilium/cilium.sock" subsys=daemon
level=info msg="New endpoint" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=1839 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=info msg="Resolving identity labels (blocking)" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=1839 identityLabels="reserved:health" ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=info msg="Identity of endpoint changed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=1839 identity=4 identityLabels="reserved:health" ipv4= ipv6= k8sPodName=/ oldIdentity="no identity" subsys=endpoint
{"level":"warn","ts":"2022-11-24T14:00:04.350Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004b76c0/api.internal.cilium.k8s.local:4003","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
level=info msg="Compiled new BPF template" BPFCompilationTime=6.739865291s file-path=/var/run/cilium/state/templates/f70528e01c114c09b4eaaef92b4e2ad28a64a08ac8dc90eb77a86c5ffdbe0913/bpf_lxc.o subsys=datapath-loader
level=info msg="Rewrote endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1839 identity=4 ipv4= ipv6= k8sPodName=/ subsys=endpoint
{"level":"warn","ts":"2022-11-24T14:00:34.081Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004b76c0/api.internal.cilium.k8s.local:4003","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2022-11-24T14:01:03.813Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004b76c0/api.internal.cilium.k8s.local:4003","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
level=info msg="Serving cilium health API at unix:///var/run/cilium/health.sock" subsys=health-server
{"level":"warn","ts":"2022-11-24T14:01:33.544Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004b76c0/api.internal.cilium.k8s.local:4003","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2022-11-24T14:02:03.276Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004b76c0/api.internal.cilium.k8s.local:4003","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2022-11-24T14:02:33.008Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004b76c0/api.internal.cilium.k8s.local:4003","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
level=error msg="Unable to initialize local node. Retrying..." error="timeout while retrieving initial list of objects from kvstore" subsys=nodediscovery
{"level":"warn","ts":"2022-11-24T14:03:02.740Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004b76c0/api.internal.cilium.k8s.local:4003","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2022-11-24T14:03:32.471Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004b76c0/api.internal.cilium.k8s.local:4003","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}

9. Anything else do we need to know?

I've inspected node-to-master-<> firewall rule and there is no port 4003. I've added port 4003 to the firewall rule by hand. After that pods on worker nodes stoped restarting and there was no more error messages in the logs.

hakman commented 1 year ago

Thanks for reporting this @fbozic. The fix for the issue should be part of today's releases: https://github.com/kubernetes/kops/releases/tag/v1.24.5 https://github.com/kubernetes/kops/releases/tag/v1.25.3

fbozic commented 1 year ago

That was really fast @hakman, thank you.