projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.03k stars 1.35k forks source link

Context deadline exceeded error while accessing webhook #3809

Closed satishrao84 closed 4 years ago

satishrao84 commented 4 years ago

I have a kubernetes 1.17.8 5 node cluster with Callico 3.11 There seems to be issues with general connectivity to services while deploying. For example, when trying to deploy a resource, a call to an admission controller webhook fails with a Context deadline exceeded error Below is the logs from kubeapi server:

I0719 17:40:57.210006 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 0 }] I0719 17:40:57.242676 1 client.go:361] parsed scheme: "endpoint" I0719 17:40:57.242719 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 0 }] I0719 17:46:09.471881 1 trace.go:116] Trace[1869591222]: "Call mutating webhook" configuration:couchbase-operator-admission,webhook:couchbase-operator-admission.default.svc,resource:couchbase.com/v2, Resource=couchbasebuckets,subresource:,operation:CREATE,UID:91264f7a-7844-4e82-af64-4bce312f2dc8 (started: 2020-07-19 17:45:39.471540102 +0000 UTC m=+505595.125356740) (total time: 30.00024599s): Trace[1869591222]: [30.00024599s] [30.00024599s] END W0719 17:46:09.471959 1 dispatcher.go:168] Failed calling webhook, failing open couchbase-operator-admission.default.svc: failed calling webhook "couchbase-operator-admission.default.svc": Post https://couchbase-operator-admission.default.svc:443/couchbaseclusters/mutate?timeout=30s: context deadline exceeded E0719 17:46:09.471986 1 dispatcher.go:169] failed calling webhook "couchbase-operator-admission.default.svc": Post https://couchbase-operator-admission.default.svc:443/couchbaseclusters/mutate?timeout=30s: context deadline exceeded I0719 17:46:13.471113 1 trace.go:116] Trace[1719736496]: "Call validating webhook" configuration:couchbase-operator-admission,webhook:couchbase-operator-admission.default.svc,resource:couchbase.com/v2, Resource=couchbasebuckets,subresource:,operation:CREATE,UID:bc50c0a0-82b3-4718-86de-f87375ea6c53 (started: 2020-07-19 17:46:09.472365456 +0000 UTC m=+505625.126182007) (total time: 3.998680644s): Trace[1719736496]: [3.998680644s] [3.998680644s] END W0719 17:46:13.471168 1 dispatcher.go:128] Failed calling webhook, failing open couchbase-operator-admission.default.svc: failed calling webhook "couchbase-operator-admission.default.svc": Post https://couchbase-operator-admission.default.svc:443/couchbaseclusters/validate?timeout=4s: context deadline exceeded E0719 17:46:13.471196 1 dispatcher.go:129] failed calling webhook "couchbase-operator-admission.default.svc": Post https://couchbase-operator-admission.default.svc:443/couchbaseclusters/validate?timeout=4s: context deadline exceeded E0719 17:46:13.475362 1 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{} I0719 17:46:13.475738 1 trace.go:116] Trace[378760673]: "Create" url:/apis/couchbase.com/v2/namespaces/default/couchbasebuckets,user-agent:kubectl/v1.17.8 (linux/amd64) kubernetes/35dc4cd,client:10.22.76.244 (started: 2020-07-19 17:45:39.470793251 +0000 UTC m=+505595.124609780) (total time: 34.004906277s): Trace[378760673]: [30.001251082s] [30.000964091s] About to store object in database Trace[378760673]: [34.004906277s] [4.003655195s] END I0719 17:46:43.481534 1 trace.go:116] Trace[1385062493]: "Call mutating webhook" configuration:couchbase-operator-admission,webhook:couchbase-operator-admission.default.svc,resource:couchbase.com/v2, Resource=couchbaseclusters,subresource:,operation:CREATE,UID:162d9f9d-890d-4243-9fa5-5a054b713220 (started: 2020-07-19 17:46:13.481299836 +0000 UTC m=+505629.135116403) (total time: 30.000185091s): Trace[1385062493]: [30.000185091s] [30.000185091s] END W0719 17:46:43.481607 1 dispatcher.go:168] Failed calling webhook, failing open couchbase-operator-admission.default.svc: failed calling webhook "couchbase-operator-admission.default.svc": Post https://couchbase-operator-admission.default.svc:443/couchbaseclusters/mutate?timeout=30s: context deadline exceeded E0719 17:46:43.481633 1 dispatcher.go:169] failed calling webhook "couchbase-operator-admission.default.svc": Post https://couchbase-operator-admission.default.svc:443/couchbaseclusters/mutate?timeout=30s: context deadline exceeded I0719 17:46:47.481061 1 trace.go:116] Trace[373230098]: "Call validating webhook" configuration:couchbase-operator-admission,webhook:couchbase-operator-admission.default.svc,resource:couchbase.com/v2, Resource=couchbaseclusters,subresource:,operation:CREATE,UID:1896885d-da68-4c65-be4b-9f4a5359ca95 (started: 2020-07-19 17:46:43.482220459 +0000 UTC m=+505659.136036982) (total time: 3.998791779s): Trace[373230098]: [3.998791779s] [3.998791779s] END W0719 17:46:47.481105 1 dispatcher.go:128] Failed calling webhook, failing open couchbase-operator-admission.default.svc: failed calling webhook "couchbase-operator-admission.default.svc": Post https://couchbase-operator-admission.default.svc:443/couchbaseclusters/validate?timeout=4s: context deadline exceeded E0719 17:46:47.481123 1 dispatcher.go:129] failed calling webhook "couchbase-operator-admission.default.svc": Post https://couchbase-operator-admission.default.svc:443/couchbaseclusters/validate?timeout=4s: context deadline exceeded E0719 17:46:47.483563 1 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{} I0719 17:46:47.483793 1 trace.go:116] Trace[83637915]: "Create" url:/apis/couchbase.com/v2/namespaces/default/couchbaseclusters,user-agent:kubectl/v1.17.8 (linux/amd64) kubernetes/35dc4cd,client:10.22.76.244 (started: 2020-07-19 17:46:13.480783912 +0000 UTC m=+505629.134600444) (total time: 34.002982294s): Trace[83637915]: [30.000917697s] [30.000529434s] About to store object in database Trace[83637915]: [34.002982294s] [4.002064597s] END I0719 17:52:59.286869 1 trace.go:116] Trace[140515269]: "Get" url:/api/v1/namespaces/kube-system/pods/calico-kube-controllers-58c67bc699-zg88w/log,user-agent:kubectl/v1.17.8 (linux/amd64) kubernetes/35dc4cd,client:10.22.76.244 (started: 2020-07-19 17:49:47.539335812 +0000 UTC m=+505843.193152325) (total time: 3m11.747440224s): Trace[140515269]: [3m11.747438263s] [3m11.745459258s] Transformed response object I0719 17:54:10.931639 1 trace.go:116] Trace[1070139994]: "Get" url:/api/v1/namespaces/kube-system/pods/calico-kube-controllers-58c67bc699-zg88w/log,user-agent:kubectl/v1.17.8 (linux/amd64) kubernetes/35dc4cd,client:10.22.76.244 (started: 2020-07-19 17:53:55.336563795 +0000 UTC m=+506090.990380369) (total time: 15.595006355s): Trace[1070139994]: [15.595004759s] [15.593091615s] Transformed response object I0719 17:54:53.898575 1 trace.go:116] Trace[1957621698]: "Get" url:/api/v1/namespaces/kube-system/pods/calico-node-ljhxz/log,user-agent:kubectl/v1.17.8 (linux/amd64) kubernetes/35dc4cd,client:10.22.76.244 (started: 2020-07-19 17:54:51.997584286 +0000 UTC m=+506147.651400820) (total time: 1.900944205s): Trace[1957621698]: [1.90094165s] [1.897939411s] Transformed response object

The caliico components are deployed in kube-system namespace

[root@lpdkubpoc01a couchbase-autonomous-operator-kubernetes_2.0.1-linux-x86_64]# kubectl get pods -n kube-system
NAME                                                READY   STATUS    RESTARTS   AGE
calico-kube-controllers-58c67bc699-zg88w            1/1     Running   0          5d20h
calico-node-97c46                                   0/1     Running   0          5d20h
calico-node-b2srp                                   0/1     Running   0          5d20h
calico-node-ljhxz                                   1/1     Running   0          5d19h
calico-node-qbxzg                                   0/1     Running   0          5d19h
calico-node-rlrhh                                   1/1     Running   0          5d19h
coredns-598947db54-k4jq7                            1/1     Running   0          5d20h
coredns-598947db54-pjrpc                            1/1     Running   0          5d20h
etcd-lpdkubpoc01a.phx.aexp.com                      1/1     Running   0          5d20h
kube-apiserver-lpdkubpoc01a.phx.aexp.com            1/1     Running   0          5d20h
kube-controller-manager-lpdkubpoc01a.phx.aexp.com   1/1     Running   0          5d20h
kube-proxy-59r5g                                    1/1     Running   0          5d20h
kube-proxy-rfd2t                                    1/1     Running   0          5d20h
kube-proxy-tj6x5                                    1/1     Running   0          5d19h
kube-proxy-wlf2f                                    1/1     Running   0          5d19h
kube-proxy-xjgzt                                    1/1     Running   0          5d19h
kube-scheduler-lpdkubpoc01a.phx.aexp.com            1/1     Running   0          5d20h

I see no logs coming in from coredns

[root@lpdkubpoc01a couchbase-autonomous-operator-kubernetes_2.0.1-linux-x86_64]# kubectl logs coredns-598947db54-k4jq7  -n kube-system --tail 100
[root@lpdkubpoc01a couchbase-autonomous-operator-kubernetes_2.0.1-linux-x86_64]# kubectl logs coredns-598947db54-pjrpc   -n kube-system --tail 100
[root@lpdkubpoc01a couchbase-autonomous-operator-kubernetes_2.0.1-linux-x86_64]# 

I don't see any logs from Calico's controller either :

[root@lpdkubpoc01a couchbase-autonomous-operator-kubernetes_2.0.1-linux-x86_64]# kubectl logs  calico-kube-controllers-58c67bc699-zg88w -n kube-system --tail 100
[root@lpdkubpoc01a couchbase-autonomous-operator-kubernetes_2.0.1-linux-x86_64]# 

For reference, our cluster has the following IP networks:

Node 10.22.76.0/23 Cluster 192.168.0.0/16 service 10.96.0.1

I am unable to resolve any service names:

root@utils:/# curl https://couchbase-operator-admission.default.svc:443

curl: (6) Could not resolve host: couchbase-operator-admission.default.svc
root@utils:/# 
root@utils:/# nslookup couchbase-operator-admission.default.svc
;; connection timed out; no servers could be reached

root@utils:/# 

basically, I am not sure how to troubleshoot this further. Can someone please point me to the right direction?

caseydavenport commented 4 years ago

@satishrao84 this sounds like it might be a more general issue with pod networking rather than just webhooks, since you are also unable to resolve service IPs from cluster DNS.

I would try to verify which packet paths are working. For example:

Identifying which paths work and which do not will help isolate the cause.