Closed navinag closed 5 years ago
@navinag Hi, thanks for bringing this up. The CNI Timeout: request did not complete within requested timeout 30s
error is a bit too general. We will need your help to gather more Linkerd-relevant info. Here are a few things to verify:
kubectl describe node
to check for any memory/cpu pressurelinkerd check [-n ns] --proxy
return? If this hangs, it means some Linkerd proxies didn't come up cleanly@ihcsim : Thanks for the response. Please find the answers below.
@navinag Hi, thanks for bringing this up. The CNI
Timeout: request did not complete within requested timeout 30s
error is a bit too general. We will need your help to gather more Linkerd-relevant info. Here are a few things to verify:
- Does your pod come up healthy when the Linkerd proxy isn't injected?
After install linkerd in the cluster, pods don't come up even without proxy injected. Once I uninstall linkerd from the cluster, everything seems to be working fine. I also tried injecting Linkerd proxy on the same pod in a GKE cluster without private, and everything worked smoothly.
- Is this affecting only a particular workload? Anything about this application that you need to will be relevant to a request timeout error (http, grpc, tcp, stateful database etc.)
Its affecting all the pods in the cluster.
- Does your cluster have enough resources to run your workload? Try
kubectl describe node
to check for any memory/cpu pressure
Everything seems fine here. `Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits
cpu 4011m (50%) 7623m (96%) memory 5006Mi (18%) 4332Mi (16%) ephemeral-storage 0 (0%) 0 (0%) attachable-volumes-gce-pd 0 0`
- What does
linkerd check [-n ns] --proxy
return? If this hangs, it means some Linkerd proxies didn't come up cleanly
All the checks are good. ` linkerd check --proxy
√ can initialize the client √ can query the Kubernetes API
√ is running the minimum Kubernetes API version √ is running the minimum kubectl version
√ control plane Namespace exists √ control plane ClusterRoles exist √ control plane ClusterRoleBindings exist √ control plane ServiceAccounts exist √ control plane CustomResourceDefinitions exist
√ control plane components ready √ no unschedulable pods √ controller pod is running √ can initialize the client √ can query the control plane API
√ control plane pods are ready √ control plane self-check √ [kubernetes] control plane can talk to Kubernetes √ [prometheus] control plane can talk to Prometheus √ no invalid service profiles
√ can determine the latest version ‼ cli is up-to-date is running version 19.5.3 but the latest edge version is 19.5.4 see https://linkerd.io/checks/#l5d-version-cli for hints
√ data plane namespace exists √ data plane proxies are ready √ data plane proxy metrics are present in Prometheus ‼ data plane is up-to-date linkerd/linkerd-identity-67b5689989-47nsc: is running version 19.5.3 but the latest edge version is 19.5.4 see https://linkerd.io/checks/#l5d-data-plane-version for hints √ data plane and cli versions match
Status check results are √ `
To validate the issue is not with my service, I created a fresh GKE private cluster, and tried installing the demo app. The app comes up, but its not injecting the linkerd-proxy
Steps
Created cluster using this command
gcloud beta container --project "prod-v1" clusters create "standard-cluster-2" --region "us-central1" --no-enable-basic-auth --cluster-version "1.12.7-gke.10" --machine-type "n1-standard-1" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "1" --enable-stackdriver-kubernetes --enable-private-nodes --master-ipv4-cidr "172.21.0.0/28" --enable-ip-alias --network "projects/prod-v1/global/networks/default" --subnetwork "projects/prod-v1/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --enable-master-authorized-networks --master-authorized-networks 0.0.0.0/0 --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair
linkerd version Client version: edge-19.5.4 Server version: edge-19.5.4
Followed the steps in getting started.
Output of kubectl describe pods -n emojivoto emoji Note: It has the annotation, but linkerd proxy is missing
Name: emoji-f6685f6dd-67hr2
Namespace: emojivoto
Priority: 0
PriorityClassName: <none>
Node: gke-standard-cluster-2-default-pool-e926bc11-tx2s/10.128.0.97
Start Time: Sat, 01 Jun 2019 13:34:46 -0700
Labels: app=emoji-svc
pod-template-hash=f6685f6dd
Annotations: linkerd.io/inject: enabled
Status: Running
IP: 10.32.1.11
Controlled By: ReplicaSet/emoji-f6685f6dd
Containers:
emoji-svc:
Container ID: docker://7a425f882f16d640e269d8c60b030440e7b05f0f7524015b1e7944285c9d4e4f
Image: buoyantio/emojivoto-emoji-svc:v8
Image ID: docker-pullable://buoyantio/emojivoto-emoji-svc@sha256:9401c5849955b29f76ef097e22c564afde905ee6b02e66e4d3ef8bcc7d546fd1
Port: 8080/TCP
Host Port: 0/TCP
State: Running
Started: Sat, 01 Jun 2019 13:34:48 -0700
Ready: True
Restart Count: 0
Requests:
cpu: 100m
Environment:
GRPC_PORT: 8080
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from emoji-token-khvj2 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
emoji-token-khvj2:
Type: Secret (a volume populated by a Secret)
SecretName: emoji-token-khvj2
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 38m default-scheduler Successfully assigned emojivoto/emoji-f6685f6dd-67hr2 to gke-standard-cluster-2-default-pool-e926bc11-tx2s
Normal Pulling 38m kubelet, gke-standard-cluster-2-default-pool-e926bc11-tx2s pulling image "buoyantio/emojivoto-emoji-svc:v8"
Normal Pulled 38m kubelet, gke-standard-cluster-2-default-pool-e926bc11-tx2s Successfully pulled image "buoyantio/emojivoto-emoji-svc:v8"
Normal Created 38m kubelet, gke-standard-cluster-2-default-pool-e926bc11-tx2s Created container
Normal Started 38m kubelet, gke-standard-cluster-2-default-pool-e926bc11-tx2s Started container
To validate the issue is only with GKE private cluster, created one without that configuration and everything worked smoothly.
Command used to create
gcloud beta container --project "prod-v1" clusters create "standard-cluster-3" --region "us-central1" --no-enable-basic-auth --cluster-version "1.12.7-gke.10" --machine-type "n1-standard-1" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "1" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/prod-v1/global/networks/default" --subnetwork "projects/prod-v1/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair
Thanks for the additional details. I am able to reproduce this problem with a GKE private cluster (1.12-gke, 1.13-gke), when network policy is enabled. Specifically, using a single-node GKE private cluster with alias IP and network policy enabled, all injected and uninjected workloads got stuck with the same CNI timeout error after the Linkerd control plane was installed. Without the Linkerd control plane everything seems to work fine. Does this resemble what you are seeing?
If you aren't deploying any network policies, the workaround for now is to disable it on your cluster.
At the moment, there aren't any useful logs or events in the kube-system
namespace (or elsewhere) to help locate the problem. I'm gonna mark this issue for triage, and we will prioritise from there.
Also, for the record, because private clusters only pull images from GCR and its dockerhub mirrors, I have to change the Linkerd Prometheus image to one that I manually push to GCR.
Thanks for the additional details. I am able to reproduce this problem with a GKE private cluster (1.12-gke, 1.13-gke), when network policy is enabled. Specifically, using a single-node GKE private cluster with alias IP and network policy enabled, all injected and uninjected workloads got stuck with the same CNI timeout error after the Linkerd control plane was installed. Without the Linkerd control plane everything seems to work fine. Does this resemble what you are seeing?
Yes, this resembles to what I am seeing.
If you aren't deploying any network policies, the workaround for now is to disable it on your cluster.
I tried creating one without network policy, check the command above. With this the pods come up, but linkerd proxy doesn't get injected.
At the moment, there aren't any useful logs or events in the
kube-system
namespace (or elsewhere) to help locate the problem. I'm gonna mark this issue for triage, and we will prioritise from there.Also, for the record, because private clusters only pull images from GCR and its dockerhub mirrors, I have to change the Linkerd Prometheus image to one that I manually push to GCR.
@navinag can you get the kubelet logs for us? It should have something around the CNI setup specifically. As CNI runs before we do anything, it is particularly confusing that you're having issues there. I've run into issues with GKE 1.13 and network policy, maybe you're running into something along those lines?
Folks, please see also https://github.com/linkerd/linkerd2/issues/2940
@navinag Can you try this again on a private GKE cluster, but making sure that the firewall rule between your gke master and nodes is updated to whitelist the proxy injector's 8443 port, as describe in https://github.com/linkerd/linkerd2/issues/2940#issuecomment-502950606? This ensures that the k8s api server can send its admission requests to the proxy injector (at the alias/secondary IP range). I just tested a private GKE 1.13 cluster and Linkerd edge-19.6.2, with network policy enabled; it seems to work for me.
@navinag any update?
I just re-tested this using a Linkerd edge-19.7.1 on a private GKE 1.13.7-gke.8 cluster, with the alias IP and network policy enabled. After updating the GCP firewall rule per https://github.com/linkerd/linkerd2/issues/2875#issuecomment-504597957, everything seems to be working. Recommend either closing or moving this issue out of the 2.4 - Release
project.
I'll be closing this out, please reopen if it still isn't working for you!
Forgot to comment before, adding the firewall rules resolved the issue. Also this documentation is really helpful https://linkerd.io/2/reference/cluster-configuration/#private-clusters
I have a GKE private cluster, following the steps here and using the edge release, after injecting linkerd proxy pods get stuck in ContainerCreating state.
Exception while creating pod
Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5fa2ed13e05ff37acb63c8784a617142e11d929d9241b52ff2dd19df02d0c34f" network for pod "search-85d9cf767d-2fx6h": NetworkPlugin cni failed to set up pod "search-85d9cf767d-2fx6h_default" network: Timeout: request did not complete within requested timeout 30s
linkerd version Client version: edge-19.5.3 Server version: edge-19.5.3
kubectl version --short Client Version: v1.12.7 Server Version: v1.12.7-gke.17
`linkerd check kubernetes-api
√ can initialize the client √ can query the Kubernetes API
kubernetes-version
√ is running the minimum Kubernetes API version √ is running the minimum kubectl version
linkerd-config
√ control plane Namespace exists √ control plane ClusterRoles exist √ control plane ClusterRoleBindings exist √ control plane ServiceAccounts exist √ control plane CustomResourceDefinitions exist
linkerd-existence
√ control plane components ready √ no unschedulable pods √ controller pod is running √ can initialize the client √ can query the control plane API
linkerd-api
√ control plane pods are ready √ control plane self-check √ [kubernetes] control plane can talk to Kubernetes √ [prometheus] control plane can talk to Prometheus √ no invalid service profiles
linkerd-version
√ can determine the latest version √ cli is up-to-date
control-plane-version
√ control plane is up-to-date √ control plane and cli versions match
Status check results are √`