linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.62k stars 1.28k forks source link

GKE private cluster pods stuck in ContainerCreating #2875

Closed navinag closed 5 years ago

navinag commented 5 years ago

I have a GKE private cluster, following the steps here and using the edge release, after injecting linkerd proxy pods get stuck in ContainerCreating state.

Exception while creating pod

Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5fa2ed13e05ff37acb63c8784a617142e11d929d9241b52ff2dd19df02d0c34f" network for pod "search-85d9cf767d-2fx6h": NetworkPlugin cni failed to set up pod "search-85d9cf767d-2fx6h_default" network: Timeout: request did not complete within requested timeout 30s

linkerd version Client version: edge-19.5.3 Server version: edge-19.5.3

kubectl version --short Client Version: v1.12.7 Server Version: v1.12.7-gke.17

`linkerd check kubernetes-api

√ can initialize the client √ can query the Kubernetes API

kubernetes-version

√ is running the minimum Kubernetes API version √ is running the minimum kubectl version

linkerd-config

√ control plane Namespace exists √ control plane ClusterRoles exist √ control plane ClusterRoleBindings exist √ control plane ServiceAccounts exist √ control plane CustomResourceDefinitions exist

linkerd-existence

√ control plane components ready √ no unschedulable pods √ controller pod is running √ can initialize the client √ can query the control plane API

linkerd-api

√ control plane pods are ready √ control plane self-check √ [kubernetes] control plane can talk to Kubernetes √ [prometheus] control plane can talk to Prometheus √ no invalid service profiles

linkerd-version

√ can determine the latest version √ cli is up-to-date

control-plane-version

√ control plane is up-to-date √ control plane and cli versions match

Status check results are √`

ihcsim commented 5 years ago

@navinag Hi, thanks for bringing this up. The CNI Timeout: request did not complete within requested timeout 30s error is a bit too general. We will need your help to gather more Linkerd-relevant info. Here are a few things to verify:

  1. Does your pod come up healthy when the Linkerd proxy isn't injected?
  2. Is this affecting only a particular workload? Anything about this application that you need to will be relevant to a request timeout error (http, grpc, tcp, stateful database etc.)
  3. Does your cluster have enough resources to run your workload? Try kubectl describe node to check for any memory/cpu pressure
  4. What does linkerd check [-n ns] --proxy return? If this hangs, it means some Linkerd proxies didn't come up cleanly
navinag commented 5 years ago

@ihcsim : Thanks for the response. Please find the answers below.

@navinag Hi, thanks for bringing this up. The CNI Timeout: request did not complete within requested timeout 30s error is a bit too general. We will need your help to gather more Linkerd-relevant info. Here are a few things to verify:

  1. Does your pod come up healthy when the Linkerd proxy isn't injected?

After install linkerd in the cluster, pods don't come up even without proxy injected. Once I uninstall linkerd from the cluster, everything seems to be working fine. I also tried injecting Linkerd proxy on the same pod in a GKE cluster without private, and everything worked smoothly.

  1. Is this affecting only a particular workload? Anything about this application that you need to will be relevant to a request timeout error (http, grpc, tcp, stateful database etc.)

Its affecting all the pods in the cluster.

  1. Does your cluster have enough resources to run your workload? Try kubectl describe node to check for any memory/cpu pressure

Everything seems fine here. `Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits


cpu 4011m (50%) 7623m (96%) memory 5006Mi (18%) 4332Mi (16%) ephemeral-storage 0 (0%) 0 (0%) attachable-volumes-gce-pd 0 0`

  1. What does linkerd check [-n ns] --proxy return? If this hangs, it means some Linkerd proxies didn't come up cleanly

All the checks are good. ` linkerd check --proxy

kubernetes-api

√ can initialize the client √ can query the Kubernetes API

kubernetes-version

√ is running the minimum Kubernetes API version √ is running the minimum kubectl version

linkerd-config

√ control plane Namespace exists √ control plane ClusterRoles exist √ control plane ClusterRoleBindings exist √ control plane ServiceAccounts exist √ control plane CustomResourceDefinitions exist

linkerd-existence

√ control plane components ready √ no unschedulable pods √ controller pod is running √ can initialize the client √ can query the control plane API

linkerd-api

√ control plane pods are ready √ control plane self-check √ [kubernetes] control plane can talk to Kubernetes √ [prometheus] control plane can talk to Prometheus √ no invalid service profiles

linkerd-version

√ can determine the latest version ‼ cli is up-to-date is running version 19.5.3 but the latest edge version is 19.5.4 see https://linkerd.io/checks/#l5d-version-cli for hints

linkerd-data-plane

√ data plane namespace exists √ data plane proxies are ready √ data plane proxy metrics are present in Prometheus ‼ data plane is up-to-date linkerd/linkerd-identity-67b5689989-47nsc: is running version 19.5.3 but the latest edge version is 19.5.4 see https://linkerd.io/checks/#l5d-data-plane-version for hints √ data plane and cli versions match

Status check results are √ `

navinag commented 5 years ago

To validate the issue is not with my service, I created a fresh GKE private cluster, and tried installing the demo app. The app comes up, but its not injecting the linkerd-proxy

Steps Created cluster using this command gcloud beta container --project "prod-v1" clusters create "standard-cluster-2" --region "us-central1" --no-enable-basic-auth --cluster-version "1.12.7-gke.10" --machine-type "n1-standard-1" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "1" --enable-stackdriver-kubernetes --enable-private-nodes --master-ipv4-cidr "172.21.0.0/28" --enable-ip-alias --network "projects/prod-v1/global/networks/default" --subnetwork "projects/prod-v1/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --enable-master-authorized-networks --master-authorized-networks 0.0.0.0/0 --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair

linkerd version Client version: edge-19.5.4 Server version: edge-19.5.4

Followed the steps in getting started.

Output of kubectl describe pods -n emojivoto emoji Note: It has the annotation, but linkerd proxy is missing

Name:               emoji-f6685f6dd-67hr2
Namespace:          emojivoto
Priority:           0
PriorityClassName:  <none>
Node:               gke-standard-cluster-2-default-pool-e926bc11-tx2s/10.128.0.97
Start Time:         Sat, 01 Jun 2019 13:34:46 -0700
Labels:             app=emoji-svc
                    pod-template-hash=f6685f6dd
Annotations:        linkerd.io/inject: enabled
Status:             Running
IP:                 10.32.1.11
Controlled By:      ReplicaSet/emoji-f6685f6dd
Containers:
  emoji-svc:
    Container ID:   docker://7a425f882f16d640e269d8c60b030440e7b05f0f7524015b1e7944285c9d4e4f
    Image:          buoyantio/emojivoto-emoji-svc:v8
    Image ID:       docker-pullable://buoyantio/emojivoto-emoji-svc@sha256:9401c5849955b29f76ef097e22c564afde905ee6b02e66e4d3ef8bcc7d546fd1
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 01 Jun 2019 13:34:48 -0700
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:  100m
    Environment:
      GRPC_PORT:  8080
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from emoji-token-khvj2 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  emoji-token-khvj2:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  emoji-token-khvj2
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From                                                        Message
  ----    ------     ----  ----                                                        -------
  Normal  Scheduled  38m   default-scheduler                                           Successfully assigned emojivoto/emoji-f6685f6dd-67hr2 to gke-standard-cluster-2-default-pool-e926bc11-tx2s
  Normal  Pulling    38m   kubelet, gke-standard-cluster-2-default-pool-e926bc11-tx2s  pulling image "buoyantio/emojivoto-emoji-svc:v8"
  Normal  Pulled     38m   kubelet, gke-standard-cluster-2-default-pool-e926bc11-tx2s  Successfully pulled image "buoyantio/emojivoto-emoji-svc:v8"
  Normal  Created    38m   kubelet, gke-standard-cluster-2-default-pool-e926bc11-tx2s  Created container
  Normal  Started    38m   kubelet, gke-standard-cluster-2-default-pool-e926bc11-tx2s  Started container
navinag commented 5 years ago

To validate the issue is only with GKE private cluster, created one without that configuration and everything worked smoothly.

Command used to create

gcloud beta container --project "prod-v1" clusters create "standard-cluster-3" --region "us-central1" --no-enable-basic-auth --cluster-version "1.12.7-gke.10" --machine-type "n1-standard-1" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "1" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/prod-v1/global/networks/default" --subnetwork "projects/prod-v1/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair

ihcsim commented 5 years ago

Thanks for the additional details. I am able to reproduce this problem with a GKE private cluster (1.12-gke, 1.13-gke), when network policy is enabled. Specifically, using a single-node GKE private cluster with alias IP and network policy enabled, all injected and uninjected workloads got stuck with the same CNI timeout error after the Linkerd control plane was installed. Without the Linkerd control plane everything seems to work fine. Does this resemble what you are seeing?

If you aren't deploying any network policies, the workaround for now is to disable it on your cluster.

At the moment, there aren't any useful logs or events in the kube-system namespace (or elsewhere) to help locate the problem. I'm gonna mark this issue for triage, and we will prioritise from there.

Also, for the record, because private clusters only pull images from GCR and its dockerhub mirrors, I have to change the Linkerd Prometheus image to one that I manually push to GCR.

navinag commented 5 years ago

Thanks for the additional details. I am able to reproduce this problem with a GKE private cluster (1.12-gke, 1.13-gke), when network policy is enabled. Specifically, using a single-node GKE private cluster with alias IP and network policy enabled, all injected and uninjected workloads got stuck with the same CNI timeout error after the Linkerd control plane was installed. Without the Linkerd control plane everything seems to work fine. Does this resemble what you are seeing?

Yes, this resembles to what I am seeing.

If you aren't deploying any network policies, the workaround for now is to disable it on your cluster.

I tried creating one without network policy, check the command above. With this the pods come up, but linkerd proxy doesn't get injected.

At the moment, there aren't any useful logs or events in the kube-system namespace (or elsewhere) to help locate the problem. I'm gonna mark this issue for triage, and we will prioritise from there.

Also, for the record, because private clusters only pull images from GCR and its dockerhub mirrors, I have to change the Linkerd Prometheus image to one that I manually push to GCR.

grampelberg commented 5 years ago

@navinag can you get the kubelet logs for us? It should have something around the CNI setup specifically. As CNI runs before we do anything, it is particularly confusing that you're having issues there. I've run into issues with GKE 1.13 and network policy, maybe you're running into something along those lines?

bzon commented 5 years ago

Folks, please see also https://github.com/linkerd/linkerd2/issues/2940

ihcsim commented 5 years ago

@navinag Can you try this again on a private GKE cluster, but making sure that the firewall rule between your gke master and nodes is updated to whitelist the proxy injector's 8443 port, as describe in https://github.com/linkerd/linkerd2/issues/2940#issuecomment-502950606? This ensures that the k8s api server can send its admission requests to the proxy injector (at the alias/secondary IP range). I just tested a private GKE 1.13 cluster and Linkerd edge-19.6.2, with network policy enabled; it seems to work for me.

siggy commented 5 years ago

@navinag any update?

ihcsim commented 5 years ago

I just re-tested this using a Linkerd edge-19.7.1 on a private GKE 1.13.7-gke.8 cluster, with the alias IP and network policy enabled. After updating the GCP firewall rule per https://github.com/linkerd/linkerd2/issues/2875#issuecomment-504597957, everything seems to be working. Recommend either closing or moving this issue out of the 2.4 - Release project.

grampelberg commented 5 years ago

I'll be closing this out, please reopen if it still isn't working for you!

navinag commented 4 years ago

Forgot to comment before, adding the firewall rules resolved the issue. Also this documentation is really helpful https://linkerd.io/2/reference/cluster-configuration/#private-clusters