kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.05k stars 3.97k forks source link

VPA continuously restarting PODs for hamster app #2789

Closed GreyOBox closed 4 years ago

GreyOBox commented 4 years ago

Kubernetes:

kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.10", GitCommit:"e3c134023df5dea457638b614ee17ef234dc34a6", GitTreeState:"clean", BuildDate:"2019-07-08T03:50:59Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-c0eccc", GitCommit:"c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2", GitTreeState:"clean", BuildDate:"2019-12-22T23:14:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

VPA: Fresh master (0.6.3)

Deployed to Kubernetes as described in docs, all VPA pods are running, service is ok, no errors in logs. Deployed example from docs kubectl create -f examples/hamster.yaml

First few hours kubectl describe vpa shows empty Status and Events. In vpa-recommender logs I see only errors like:

E0129 12:02:44.091276       1 recommender.go:124] Cannot update VPA hamster-vpa object. Reason: Timeout: request did not complete within requested timeout 30s
E0129 12:01:12.380558       1 cluster_feeder.go:476] Error while fetching legacy selector. Reason: v1beta1 selector not found

No errors in other VPA logs.

After few hours kubectl describe vpa shows following:

Name:         hamster-vpa
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  autoscaling.k8s.io/v1beta2
Kind:         VerticalPodAutoscaler
Metadata:
  Creation Timestamp:  2020-01-30T14:50:36Z
  Generation:          11
  Resource Version:    10475917
  Self Link:           /apis/autoscaling.k8s.io/v1beta2/namespaces/default/verticalpodautoscalers/hamster-vpa
  UID:                 e005a60a-436f-11ea-9a97-06752d6e751a
Spec:
  Target Ref:
    API Version:  apps/v1
    Kind:         Deployment
    Name:         hamster
Status:
  Conditions:
    Last Transition Time:  2020-01-30T16:15:37Z
    Status:                True
    Type:                  RecommendationProvided
  Recommendation:
    Container Recommendations:
      Container Name:  hamster
      Lower Bound:
        Cpu:     585m
        Memory:  262144k
      Target:
        Cpu:     587m
        Memory:  262144k
      Uncapped Target:
        Cpu:     587m
        Memory:  262144k
      Upper Bound:
        Cpu:     1195m
        Memory:  262144k
Events:          <none>

and VPA starts to restart hamster pods continuously (every few minutes - restart):

kubectl get pods | grep hamster
hamster-c7d89d6db-r67hw                                           1/1     Running            0          2m44s
hamster-c7d89d6db-z6nv8                                           1/1     Terminating        0          44s

Checked FAQ - vpa restarts my pods too, everything looks fine.

Enabled debug logs:

I0130 14:52:09.577009       1 round_trippers.go:438] PATCH https://10.100.0.1:443/apis/autoscaling.k8s.io/v1beta2/namespaces/default/verticalpodautoscalers/hamster-vpa 504 Gateway Timeout in 30001 milliseconds
I0130 14:52:09.577037       1 round_trippers.go:444] Response Headers:
I0130 14:52:09.577047       1 round_trippers.go:447]     Audit-Id: df17f0fa-f7f8-4ea5-bda5-1ce34d83cbcf
I0130 14:52:09.577057       1 round_trippers.go:447]     Content-Type: application/json
I0130 14:52:09.577066       1 round_trippers.go:447]     Content-Length: 187
I0130 14:52:09.577074       1 round_trippers.go:447]     Date: Thu, 30 Jan 2020 14:52:09 GMT
I0130 14:52:09.577131       1 request.go:942] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Timeout: request did not complete within requested timeout 30s","reason":"Timeout","details":{},"code":504}
E0130 14:52:09.577616       1 recommender.go:124] Cannot update VPA hamster-vpa object. Reason: Timeout: request did not complete within requested timeout 30s

but VPA service is available. All other curl lines in log shows HTTP 200. Like that:

I0130 14:51:52.883615       1 round_trippers.go:419] curl -k -v -XGET  -H "User-Agent: recommender/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Accept: application/json, */*" -H "Authorization: Bearer XXX...XXX" 'https://10.100.0.1:443/apis/apps/v1/deployments?resourceVersion=10275675&timeout=7m20s&timeoutSeconds=440&watch=true'
I0130 14:51:52.885154       1 round_trippers.go:438] GET https://10.100.0.1:443/apis/apps/v1/deployments?resourceVersion=10275675&timeout=7m20s&timeoutSeconds=440&watch=true 200 OK in 1 milliseconds

Also:

I0130 14:52:09.581723       1 round_trippers.go:438] PATCH https://10.100.0.1:443/apis/autoscaling.k8s.io/v1beta2/namespaces/default/verticalpodautoscalercheckpoints/hamster-vpa-hamster 404 Not Found in 3 milliseconds

No idea why there's hamster-vpa-hamster instead of hamster-vpa.

In vpa-updater logs repeated lines like:

I0131 14:05:32.988354       1 request.go:942] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Success","code":201}
I0131 14:05:32.988723       1 updater.go:205] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"hamster-c7d89d6db-8zmh2", UID:"894076df-4432-11ea-bff3-027400fb9bbc", APIVersion:"v1", ResourceVersion:"10503734", FieldPath:""}): type: 'Normal' reason: 'EvictedByVPA' Pod was evicted by VPA Updater to apply resource recommendation.
I0131 14:05:32.988756       1 request.go:942] Request Body: {"kind":"Event","apiVersion":"v1","metadata":{"name":"hamster-c7d89d6db-8zmh2.15eefdc2f28e8b34","namespace":"default","creationTimestamp":null},"involvedObject":{"kind":"Pod","namespace":"default","name":"hamster-c7d89d6db-8zmh2","uid":"894076df-4432-11ea-bff3-027400fb9bbc","apiVersion":"v1","resourceVersion":"10503734"},"reason":"EvictedByVPA","message":"Pod was evicted by VPA Updater to apply resource recommendation.","source":{"component":"vpa-updater"},"firstTimestamp":"2020-01-31T14:05:32Z","lastTimestamp":"2020-01-31T14:05:32Z","count":1,"type":"Normal","eventTime":null,"reportingComponent":"","reportingInstance":""}
I0131 14:05:32.988868       1 round_trippers.go:419] curl -k -v -XPOST  -H "Content-Type: application/json" -H "User-Agent: updater/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer XXX.XXX" -H "Accept: application/json, */*" 'https://10.100.0.1:443/api/v1/namespaces/default/events'
I0131 14:05:32.994305       1 round_trippers.go:438] POST https://10.100.0.1:443/api/v1/namespaces/default/events 201 Created in 5 milliseconds
I0131 14:05:32.994334       1 round_trippers.go:444] Response Headers:
I0131 14:05:32.994347       1 round_trippers.go:447]     Content-Type: application/json
I0131 14:05:32.994358       1 round_trippers.go:447]     Content-Length: 811
I0131 14:05:32.994370       1 round_trippers.go:447]     Date: Fri, 31 Jan 2020 14:05:32 GMT
I0131 14:05:32.994448       1 request.go:942] Response Body: {"kind":"Event","apiVersion":"v1","metadata":{"name":"hamster-c7d89d6db-8zmh2.15eefdc2f28e8b34","namespace":"default","selfLink":"/api/v1/namespaces/default/events/hamster-c7d89d6db-8zmh2.15eefdc2f28e8b34","uid":"bee42e90-4432-11ea-9a97-06752d6e751a","resourceVersion":"10503945","creationTimestamp":"2020-01-31T14:05:32Z"},"involvedObject":{"kind":"Pod","namespace":"default","name":"hamster-c7d89d6db-8zmh2","uid":"894076df-4432-11ea-bff3-027400fb9bbc","apiVersion":"v1","resourceVersion":"10503734"},"reason":"EvictedByVPA","message":"Pod was evicted by VPA Updater to apply resource recommendation.","source":{"component":"vpa-updater"},"firstTimestamp":"2020-01-31T14:05:32Z","lastTimestamp":"2020-01-31T14:05:32Z","count":1,"type":"Normal","eventTime":null,"reportingComponent":"","reportingInstance":""}

(POD name changes every time)

Nothing interesting in vpa-admission-controller logs.

And VPA doesn't update resources for hamster PODs:

kubectl describe pods hamster-c7d89d6db-hzxrv
...
    Requests:
      cpu:        100m
      memory:     50Mi
...
bskiba commented 4 years ago

The log from recommender seems to indicate unhealthy (overloaded?) API server.

Cannot update VPA hamster-vpa object. Reason: Timeout: request did not complete within requested timeout 30s

As for continuously updating the pods, this happens if admission controller is not working or not being called by API server. Do admission-controller logs contain lines saying it's actually updating pod spec?

Is this a hosted Kubernetes or do you have access to the master logs?

GreyOBox commented 4 years ago

The log from recommender seems to indicate unhealthy (overloaded?) API server.

But as far as I understand all other requests to API are fine:

I0130 14:51:52.883615       1 round_trippers.go:419] curl -k -v -XGET  -H "User-Agent: recommender/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Accept: application/json, */*" -H "Authorization: Bearer XXX...XXX" 'https://10.100.0.1:443/apis/apps/v1/deployments?resourceVersion=10275675&timeout=7m20s&timeoutSeconds=440&watch=true'
I0130 14:51:52.885154       1 round_trippers.go:438] GET https://10.100.0.1:443/apis/apps/v1/deployments?resourceVersion=10275675&timeout=7m20s&timeoutSeconds=440&watch=true 200 OK in 1 milliseconds

Do admission-controller logs contain lines saying it's actually updating pod spec?

I see just one line related to hamster in logs:

I0203 05:33:28.514971       1 round_trippers.go:438] GET https://10.100.0.1:443/apis/autoscaling.k8s.io/v1beta1/verticalpodautoscalers?limit=500&resourceVersion=0 200 OK in 1 milliseconds
I0203 05:33:28.514993       1 round_trippers.go:444] Response Headers:
I0203 05:33:28.515005       1 round_trippers.go:447]     Audit-Id: d5346770-ddda-43ba-a4f6-3812a29d6c87
I0203 05:33:28.515014       1 round_trippers.go:447]     Content-Type: application/json
I0203 05:33:28.515024       1 round_trippers.go:447]     Content-Length: 1045
I0203 05:33:28.515034       1 round_trippers.go:447]     Date: Mon, 03 Feb 2020 05:33:28 GMT
I0203 05:33:28.515086       1 request.go:942] Response Body: {"apiVersion":"autoscaling.k8s.io/v1beta1","items":[{"apiVersion":"autoscaling.k8s.io/v1beta1","kind":"VerticalPodAutoscaler","metadata":{"creationTimestamp":"2020-01-30T14:50:36Z","generation":55,"name":"hamster-vpa","namespace":"default","resourceVersion":"11079077","selfLink":"/apis/autoscaling.k8s.io/v1beta1/namespaces/default/verticalpodautoscalers/hamster-vpa","uid":"e005a60a-436f-11ea-9a97-06752d6e751a"},"spec":{"targetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"hamster"}},"status":{"conditions":[{"lastTransitionTime":"2020-01-30T16:15:37Z","status":"True","type":"RecommendationProvided"}],"recommendation":{"containerRecommendations":[{"containerName":"hamster","lowerBound":{"cpu":"586m","memory":"262144k"},"target":{"cpu":"587m","memory":"262144k"},"uncappedTarget":{"cpu":"587m","memory":"262144k"},"upperBound":{"cpu":"748m","memory":"262144k"}}]}}}],"kind":"VerticalPodAutoscalerList","metadata":{"continue":"","resourceVersion":"11098746","selfLink":"/apis/autoscaling.k8s.io/v1beta1/verticalpodautoscalers"}}

Is this a hosted Kubernetes or do you have access to the master logs?

It's Amazon EKS, I can see many messages like below in logs in CloudWatch for kube-apiserver:

W0202 08:27:02.991885       1 dispatcher.go:70] Failed calling webhook, failing open vpa.k8s.io: failed calling webhook "vpa.k8s.io": Post https://vpa-webhook.kube-system.svc:443/?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
E0202 08:27:02.991922       1 dispatcher.go:71] failed calling webhook "vpa.k8s.io": Post https://vpa-webhook.kube-system.svc:443/?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I0202 08:27:06.672272       1 trace.go:81] Trace[429795704]: "Patch /apis/autoscaling.k8s.io/v1beta2/namespaces/default/verticalpodautoscalers/hamster-vpa" (started: 2020-02-02 08:26:36.671909863 +0000 UTC m=+3406838.753779259) (total time: 30.000336994s):
Trace[429795704]: [30.000336994s] [29.999949547s] END

From POD inside Kube I can send request to this service

curl -vvv --insecure https://vpa-webhook.kube-system.svc:443/?timeout=30s -XPOST
*   Trying 10.100.237.112...
> POST /?timeout=30s HTTP/2
> Host: vpa-webhook.kube-system.svc
> User-Agent: curl/7.58.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 200
< content-length: 0
< date: Mon, 03 Feb 2020 08:38:42 GMT
<
* Connection #0 to host vpa-webhook.kube-system.svc left intact
bskiba commented 4 years ago

Ah right, the failed VPA object update is due to API server being unable to call webhook.

The setup seems fine from my point of view, so there is something EKS specific in play, which I am not familiar with unfortunately. #1547 seems similar, but I understand from your report that the admission controller is already listening on port 443?

GreyOBox commented 4 years ago

yes, it's listening and available via service:

~$ kubectl get pods -n kube-system | grep vpa
vpa-admission-controller-5dd89f888f-chrxh                1/1     Running                      0          3d20h
vpa-recommender-5cdb5d8576-mjd7p                         1/1     Running                      0          3d20h
vpa-updater-78897b4bdd-j6xdf                             1/1     Running                      0          3d20h

~$ kubectl describe pods -n kube-system vpa-admission-controller-5dd89f888f-chrxh
Name:               vpa-admission-controller-5dd89f888f-chrxh
Namespace:          kube-system
Priority:           0
PriorityClassName:  <none>
Node:               ip-172-30-72-105.ap-southeast-1.compute.internal/172.30.72.105
Start Time:         Thu, 30 Jan 2020 22:46:32 +0800
Labels:             app=vpa-admission-controller
                    pod-template-hash=5dd89f888f
Annotations:        cni.projectcalico.org/podIP: 10.254.198.133/32
                    kubernetes.io/psp: eks.privileged
Status:             Running
IP:                 10.254.198.133
Controlled By:      ReplicaSet/vpa-admission-controller-5dd89f888f
Containers:
  admission-controller:
    Container ID:  docker://23257afb27c5364ae0919812edbaecac92c0f3ca2a58044567e9f83b59ae4f9d
    Image:         k8s.gcr.io/vpa-admission-controller:0.6.3
    Image ID:      docker-pullable://k8s.gcr.io/vpa-admission-controller@sha256:59e415c3fe88fe54647882b137ec570d7af39efa58e65f5d4f119047e0c100cb
    Port:          8000/TCP
    Host Port:     0/TCP
    Args:
      --v=10
    State:          Running
      Started:      Thu, 30 Jan 2020 22:46:34 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  500Mi
    Requests:
      cpu:     50m
      memory:  200Mi
    Environment:
      NAMESPACE:  kube-system (v1:metadata.namespace)
    Mounts:
      /etc/tls-certs from tls-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from vpa-admission-controller-token-499fm (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  tls-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  vpa-tls-certs
    Optional:    false
  vpa-admission-controller-token-499fm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  vpa-admission-controller-token-499fm
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

~$ kubectl describe svc -n kube-system vpa-webhook
Name:              vpa-webhook
Namespace:         kube-system
Labels:            <none>
Annotations:       <none>
Selector:          app=vpa-admission-controller
Type:              ClusterIP
IP:                10.100.237.112
Port:              <unset>  443/TCP
TargetPort:        8000/TCP
Endpoints:         10.254.198.133:8000
Session Affinity:  None
Events:            <none>

~$ kubectl exec -it -n ips debug-kube-0 bash
root@debug-kube-0:/# curl -v --insecure https://vpa-webhook.kube-system/
*   Trying 10.100.237.112...
* TCP_NODELAY set
* Connected to vpa-webhook.kube-system (10.100.237.112) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=vpa-webhook.kube-system.svc
*  start date: Jan 30 14:46:28 2020 GMT
*  expire date: Nov 14 14:46:28 2293 GMT
*  issuer: CN=vpa_webhook_ca
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55a2b6284580)
> GET / HTTP/2
> Host: vpa-webhook.kube-system
> User-Agent: curl/7.58.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 200
< content-length: 0
< date: Mon, 03 Feb 2020 11:31:24 GMT
<
* Connection #0 to host vpa-webhook.kube-system left intact

I tried to remove (vpa-down.sh) and install it again with fix from #1547:

vertical-pod-autoscaler/deploy$ git diff .
diff --git a/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml b/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
index da47060..fc80a61 100644
--- a/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
+++ b/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
@@ -18,6 +18,9 @@ spec:
       containers:
         - name: admission-controller
           image: k8s.gcr.io/vpa-admission-controller:0.6.3
+          args:
+            - "--v=10"
+            - "--port=443"
           imagePullPolicy: Always
           env:
             - name: NAMESPACE
@@ -50,6 +53,6 @@ metadata:
 spec:
   ports:
     - port: 443
-      targetPort: 8000
+      targetPort: 443
   selector:
     app: vpa-admission-controller

Сhecked all VPA resources - everything looks fine, service is available. But it's not working with example application still - again hamster PODs restarts and same messages in VPA logs and AWS logs

bskiba commented 4 years ago

Then I am afraid I am unable to help, this seems to be a problem with setting up a webhook on EKS, not with the VPA itself. Perhaps you can try asking on a forum with some AWS experts?

bskiba commented 4 years ago

I'm closing this since seems we can't do anything here, but please feel free to reopen if there is something VPA-specific that we can help with. /close

k8s-ci-robot commented 4 years ago

@bskiba: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/2789#issuecomment-581876410): >I'm closing this since seems we can't do anything here, but please feel free to reopen if there is something VPA-specific that we can help with. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
GreyOBox commented 4 years ago

I'll add our solution here if someone will have the same issue In our case:

@bskiba thanks for help!

bskiba commented 4 years ago

I'm glad you managed to solve it!

bzeitler69 commented 4 years ago

I'll add our solution here if someone will have the same issue In our case:

  • We didn't change port as described in #1547
  • We use Calico CNI, not AWS CNI
  • All traffic between EKS Master and workers allowed
  • We just had to move VPA PODs to host network:
autoscaler/vertical-pod-autoscaler/deploy$ git diff .
diff --git a/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml b/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
index da47060..5236d8d 100644
--- a/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
+++ b/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
@@ -15,6 +15,8 @@ spec:
         app: vpa-admission-controller
     spec:
       serviceAccountName: vpa-admission-controller
+      hostNetwork: true
+      dnsPolicy: ClusterFirstWithHostNet
       containers:
         - name: admission-controller
           image: k8s.gcr.io/vpa-admission-controller:0.6.3
diff --git a/vertical-pod-autoscaler/deploy/recommender-deployment.yaml b/vertical-pod-autoscaler/deploy/recommender-deployment.yaml
index e0b0356..32cf90b 100644
--- a/vertical-pod-autoscaler/deploy/recommender-deployment.yaml
+++ b/vertical-pod-autoscaler/deploy/recommender-deployment.yaml
@@ -21,6 +21,8 @@ spec:
         app: vpa-recommender
     spec:
       serviceAccountName: vpa-recommender
+      hostNetwork: true
+      dnsPolicy: ClusterFirstWithHostNet
       containers:
       - name: recommender
         image: k8s.gcr.io/vpa-recommender:0.6.3
diff --git a/vertical-pod-autoscaler/deploy/updater-deployment.yaml b/vertical-pod-autoscaler/deploy/updater-deployment.yaml
index 47d1241..21f06aa 100644
--- a/vertical-pod-autoscaler/deploy/updater-deployment.yaml
+++ b/vertical-pod-autoscaler/deploy/updater-deployment.yaml
@@ -21,6 +21,8 @@ spec:
         app: vpa-updater
     spec:
       serviceAccountName: vpa-updater
+      hostNetwork: true
+      dnsPolicy: ClusterFirstWithHostNet
       containers:
         - name: updater
           image: k8s.gcr.io/vpa-updater:0.6.3

@bskiba thanks for help!

@GreyOBox, That worked for me too. Thanks for leaving the solution here!