Kubernetes Master I/O Timeout

infinitydon commented 3 years ago

Hi,

Currently trying to install calico-vpp in a baremetal deployment fashion, details of my setup below:

platform used Virtualbox, One master and one worker node.
k8s distro: I have tried both kubeadm and rancher RKE2
Link to the installation procedure: https://github.com/projectcalico/vpp-dataplane/wiki/Setup-instructions-:-Baremetal-cluster#configuration

calico controller keeps crashing, status of the pods:

NAME                                         READY   STATUS             RESTARTS   AGE    IP              NODE            NOMINATED NODE   READINESS GATES
calico-kube-controllers-57c5b6487c-khsdx     0/1     Running            23         97m    192.168.41.1    rke-master-01   <none>           <none>
calico-vpp-node-v2ztb                        2/2     Running            0          97m    172.18.56.50    rke-master-01   <none>           <none>
etcd-rke-master-01                           1/1     Running            0          110m   172.18.56.50    rke-master-01   <none>           <none>
helm-install-rke2-coredns-msgfz              0/1     Completed          0          111m   172.18.56.50    rke-master-01   <none>           <none>
helm-install-rke2-ingress-nginx-mgv64        0/1     CrashLoopBackOff   18         111m   192.168.41.53   rke-master-01   <none>           <none>
helm-install-rke2-kube-proxy-gxcxq           0/1     Completed          0          111m   172.18.56.50    rke-master-01   <none>           <none>
helm-install-rke2-metrics-server-hdlnq       0/1     CrashLoopBackOff   18         111m   192.168.41.51   rke-master-01   <none>           <none>
kube-apiserver-rke-master-01                 1/1     Running            0          110m   172.18.56.50    rke-master-01   <none>           <none>
kube-controller-manager-rke-master-01        1/1     Running            0          110m   172.18.56.50    rke-master-01   <none>           <none>
kube-proxy-swvm9                             1/1     Running            0          110m   172.18.56.50    rke-master-01   <none>           <none>
kube-scheduler-rke-master-01                 1/1     Running            0          110m   172.18.56.50    rke-master-01   <none>           <none>
rke2-coredns-rke2-coredns-65d668ddf9-rw2nv   0/1     Running            0          110m   192.168.41.44   rke-master-01   <none>           <none>

Log output from the calico kube controller:

2021-07-12 10:22:44.190 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0712 10:22:44.192511       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2021-07-12 10:22:44.193 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2021-07-12 10:22:54.193 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-07-12 10:22:54.194 [FATAL][1] main.go 114: Failed to initialize Calico datastore error=Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded

CoreDNS shows that communication with the k8s master is failing:

I0712 10:24:39.691009       1 trace.go:116] Trace[1080751526]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105 (started: 2021-07-12 10:24:09.689693225 +0000 UTC m=+6139.198133169) (total time: 30.001259358s):
Trace[1080751526]: [30.001259358s] [30.001259358s] END
E0712 10:24:39.691063       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0712 10:24:39.724351       1 trace.go:116] Trace[1713144196]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105 (started: 2021-07-12 10:24:09.723314287 +0000 UTC m=+6139.231754311) (total time: 30.000963097s):
Trace[1713144196]: [30.000963097s] [30.000963097s] END
E0712 10:24:39.724449       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0712 10:24:39.730202       1 trace.go:116] Trace[1171208825]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105 (started: 2021-07-12 10:24:09.727949723 +0000 UTC m=+6139.236389751) (total time: 30.001052086s):
Trace[1171208825]: [30.001052086s] [30.001052086s] END
E0712 10:24:39.730279       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.4/tools/cache/reflector.go:105: Failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"

I have tried to use differenct pod CIDRs, the issue remains the same. Any assistance will be appreciated

AloysAugustin commented 3 years ago

Hi @infinitydon , could you share the logs of both containers in the calico-vpp-node pod? Also, I realized the wiki wasn't up to date (I'll update it to point to the Calico docs which are now authoritative), you may have more luck with the latest version described there: https://docs.projectcalico.org/getting-started/kubernetes/vpp/getting-started (Install on any cluster tab - there should be very few changes between the wiki and this doc, except for the calico-vpp version)

infinitydon commented 3 years ago

@AloysAugustin - Thanks for the response, I have tried the installation using the link you shared but the issue remains.

Attached are the vpp, vpp-agent, calico-node and calico-controller logs

calico-kube-controller-log.txt calico-node-log.txt vpp-agent-log.txt vpp-log.txt

AloysAugustin commented 3 years ago

Hi @infinitydon , sorry for the delay in getting back to you. It looks like vpp-log.txt and vpp-agent-log.txt are the same, it would be good if you could repost the logs from the vpp container to be sure, but from what I can see in these everything looks nominal.

I tried to reproduce the issue in a similar single-node setup, but I didn't encounter the same problem. At this point I think we should start tracing the packets to understand where they are dropped: https://docs.projectcalico.org/maintenance/troubleshoot/vpp . This can be a bit of a tricky process, let us know if we can help at any point.

infinitydon commented 3 years ago

@AloysAugustin - For comparison sake, can you kindly share details about your setup (virtualization software, Linux distro & kernel version, k8s distro & version, pod-cidr, service-cidr, physical IP that was used to communicate with the master etc)?

AloysAugustin commented 3 years ago

Sure, it was a single node cluster with:

Distro: ubuntu 20.04 (host & guest)
Virtualisation software: QEMU 4.2.1
K8s: Vanilla 1.18.1
Pod CIDR: 172.16.0.0/16
Service CIDR: 10.96.0.0/12
Master IP: 192.168.0.254/24
Host IP: 192.168.0.1/24
Calico/VPP: v0.15.0-calicov3.19.1
Interface driver: virtio

infinitydon commented 3 years ago

Thanks @AloysAugustin

I just got where I mixed it up:

The address configured on this interface must be the node address in Kubernetes (kubectl get nodes -o wide).

My mind was somehow still hooked up to the fact we can pass a separate network interface (apart from the one that k8s was initialized with).

So far I have tested virtio and af_xdp, both seems to work okay.

But dpdk seems not to work, I think it's better to open a new issue to track that.

projectcalico / vpp-dataplane

Kubernetes Master I/O Timeout #202