Closed GreyOBox closed 4 years ago
The log from recommender seems to indicate unhealthy (overloaded?) API server.
Cannot update VPA hamster-vpa object. Reason: Timeout: request did not complete within requested timeout 30s
As for continuously updating the pods, this happens if admission controller is not working or not being called by API server. Do admission-controller logs contain lines saying it's actually updating pod spec?
Is this a hosted Kubernetes or do you have access to the master logs?
The log from recommender seems to indicate unhealthy (overloaded?) API server.
But as far as I understand all other requests to API are fine:
I0130 14:51:52.883615 1 round_trippers.go:419] curl -k -v -XGET -H "User-Agent: recommender/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Accept: application/json, */*" -H "Authorization: Bearer XXX...XXX" 'https://10.100.0.1:443/apis/apps/v1/deployments?resourceVersion=10275675&timeout=7m20s&timeoutSeconds=440&watch=true'
I0130 14:51:52.885154 1 round_trippers.go:438] GET https://10.100.0.1:443/apis/apps/v1/deployments?resourceVersion=10275675&timeout=7m20s&timeoutSeconds=440&watch=true 200 OK in 1 milliseconds
Do admission-controller logs contain lines saying it's actually updating pod spec?
I see just one line related to hamster in logs:
I0203 05:33:28.514971 1 round_trippers.go:438] GET https://10.100.0.1:443/apis/autoscaling.k8s.io/v1beta1/verticalpodautoscalers?limit=500&resourceVersion=0 200 OK in 1 milliseconds
I0203 05:33:28.514993 1 round_trippers.go:444] Response Headers:
I0203 05:33:28.515005 1 round_trippers.go:447] Audit-Id: d5346770-ddda-43ba-a4f6-3812a29d6c87
I0203 05:33:28.515014 1 round_trippers.go:447] Content-Type: application/json
I0203 05:33:28.515024 1 round_trippers.go:447] Content-Length: 1045
I0203 05:33:28.515034 1 round_trippers.go:447] Date: Mon, 03 Feb 2020 05:33:28 GMT
I0203 05:33:28.515086 1 request.go:942] Response Body: {"apiVersion":"autoscaling.k8s.io/v1beta1","items":[{"apiVersion":"autoscaling.k8s.io/v1beta1","kind":"VerticalPodAutoscaler","metadata":{"creationTimestamp":"2020-01-30T14:50:36Z","generation":55,"name":"hamster-vpa","namespace":"default","resourceVersion":"11079077","selfLink":"/apis/autoscaling.k8s.io/v1beta1/namespaces/default/verticalpodautoscalers/hamster-vpa","uid":"e005a60a-436f-11ea-9a97-06752d6e751a"},"spec":{"targetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"hamster"}},"status":{"conditions":[{"lastTransitionTime":"2020-01-30T16:15:37Z","status":"True","type":"RecommendationProvided"}],"recommendation":{"containerRecommendations":[{"containerName":"hamster","lowerBound":{"cpu":"586m","memory":"262144k"},"target":{"cpu":"587m","memory":"262144k"},"uncappedTarget":{"cpu":"587m","memory":"262144k"},"upperBound":{"cpu":"748m","memory":"262144k"}}]}}}],"kind":"VerticalPodAutoscalerList","metadata":{"continue":"","resourceVersion":"11098746","selfLink":"/apis/autoscaling.k8s.io/v1beta1/verticalpodautoscalers"}}
Is this a hosted Kubernetes or do you have access to the master logs?
It's Amazon EKS, I can see many messages like below in logs in CloudWatch for kube-apiserver:
W0202 08:27:02.991885 1 dispatcher.go:70] Failed calling webhook, failing open vpa.k8s.io: failed calling webhook "vpa.k8s.io": Post https://vpa-webhook.kube-system.svc:443/?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
E0202 08:27:02.991922 1 dispatcher.go:71] failed calling webhook "vpa.k8s.io": Post https://vpa-webhook.kube-system.svc:443/?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I0202 08:27:06.672272 1 trace.go:81] Trace[429795704]: "Patch /apis/autoscaling.k8s.io/v1beta2/namespaces/default/verticalpodautoscalers/hamster-vpa" (started: 2020-02-02 08:26:36.671909863 +0000 UTC m=+3406838.753779259) (total time: 30.000336994s):
Trace[429795704]: [30.000336994s] [29.999949547s] END
From POD inside Kube I can send request to this service
curl -vvv --insecure https://vpa-webhook.kube-system.svc:443/?timeout=30s -XPOST
* Trying 10.100.237.112...
> POST /?timeout=30s HTTP/2
> Host: vpa-webhook.kube-system.svc
> User-Agent: curl/7.58.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 200
< content-length: 0
< date: Mon, 03 Feb 2020 08:38:42 GMT
<
* Connection #0 to host vpa-webhook.kube-system.svc left intact
Ah right, the failed VPA object update is due to API server being unable to call webhook.
The setup seems fine from my point of view, so there is something EKS specific in play, which I am not familiar with unfortunately. #1547 seems similar, but I understand from your report that the admission controller is already listening on port 443?
yes, it's listening and available via service:
~$ kubectl get pods -n kube-system | grep vpa
vpa-admission-controller-5dd89f888f-chrxh 1/1 Running 0 3d20h
vpa-recommender-5cdb5d8576-mjd7p 1/1 Running 0 3d20h
vpa-updater-78897b4bdd-j6xdf 1/1 Running 0 3d20h
~$ kubectl describe pods -n kube-system vpa-admission-controller-5dd89f888f-chrxh
Name: vpa-admission-controller-5dd89f888f-chrxh
Namespace: kube-system
Priority: 0
PriorityClassName: <none>
Node: ip-172-30-72-105.ap-southeast-1.compute.internal/172.30.72.105
Start Time: Thu, 30 Jan 2020 22:46:32 +0800
Labels: app=vpa-admission-controller
pod-template-hash=5dd89f888f
Annotations: cni.projectcalico.org/podIP: 10.254.198.133/32
kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.254.198.133
Controlled By: ReplicaSet/vpa-admission-controller-5dd89f888f
Containers:
admission-controller:
Container ID: docker://23257afb27c5364ae0919812edbaecac92c0f3ca2a58044567e9f83b59ae4f9d
Image: k8s.gcr.io/vpa-admission-controller:0.6.3
Image ID: docker-pullable://k8s.gcr.io/vpa-admission-controller@sha256:59e415c3fe88fe54647882b137ec570d7af39efa58e65f5d4f119047e0c100cb
Port: 8000/TCP
Host Port: 0/TCP
Args:
--v=10
State: Running
Started: Thu, 30 Jan 2020 22:46:34 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 200m
memory: 500Mi
Requests:
cpu: 50m
memory: 200Mi
Environment:
NAMESPACE: kube-system (v1:metadata.namespace)
Mounts:
/etc/tls-certs from tls-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from vpa-admission-controller-token-499fm (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
tls-certs:
Type: Secret (a volume populated by a Secret)
SecretName: vpa-tls-certs
Optional: false
vpa-admission-controller-token-499fm:
Type: Secret (a volume populated by a Secret)
SecretName: vpa-admission-controller-token-499fm
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
~$ kubectl describe svc -n kube-system vpa-webhook
Name: vpa-webhook
Namespace: kube-system
Labels: <none>
Annotations: <none>
Selector: app=vpa-admission-controller
Type: ClusterIP
IP: 10.100.237.112
Port: <unset> 443/TCP
TargetPort: 8000/TCP
Endpoints: 10.254.198.133:8000
Session Affinity: None
Events: <none>
~$ kubectl exec -it -n ips debug-kube-0 bash
root@debug-kube-0:/# curl -v --insecure https://vpa-webhook.kube-system/
* Trying 10.100.237.112...
* TCP_NODELAY set
* Connected to vpa-webhook.kube-system (10.100.237.112) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* (304) (OUT), TLS handshake, Client hello (1):
* (304) (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
* subject: CN=vpa-webhook.kube-system.svc
* start date: Jan 30 14:46:28 2020 GMT
* expire date: Nov 14 14:46:28 2293 GMT
* issuer: CN=vpa_webhook_ca
* SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55a2b6284580)
> GET / HTTP/2
> Host: vpa-webhook.kube-system
> User-Agent: curl/7.58.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 200
< content-length: 0
< date: Mon, 03 Feb 2020 11:31:24 GMT
<
* Connection #0 to host vpa-webhook.kube-system left intact
I tried to remove (vpa-down.sh) and install it again with fix from #1547:
vertical-pod-autoscaler/deploy$ git diff .
diff --git a/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml b/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
index da47060..fc80a61 100644
--- a/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
+++ b/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
@@ -18,6 +18,9 @@ spec:
containers:
- name: admission-controller
image: k8s.gcr.io/vpa-admission-controller:0.6.3
+ args:
+ - "--v=10"
+ - "--port=443"
imagePullPolicy: Always
env:
- name: NAMESPACE
@@ -50,6 +53,6 @@ metadata:
spec:
ports:
- port: 443
- targetPort: 8000
+ targetPort: 443
selector:
app: vpa-admission-controller
Сhecked all VPA resources - everything looks fine, service is available. But it's not working with example application still - again hamster PODs restarts and same messages in VPA logs and AWS logs
Then I am afraid I am unable to help, this seems to be a problem with setting up a webhook on EKS, not with the VPA itself. Perhaps you can try asking on a forum with some AWS experts?
I'm closing this since seems we can't do anything here, but please feel free to reopen if there is something VPA-specific that we can help with. /close
@bskiba: Closing this issue.
I'll add our solution here if someone will have the same issue In our case:
autoscaler/vertical-pod-autoscaler/deploy$ git diff .
diff --git a/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml b/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
index da47060..5236d8d 100644
--- a/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
+++ b/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml
@@ -15,6 +15,8 @@ spec:
app: vpa-admission-controller
spec:
serviceAccountName: vpa-admission-controller
@bskiba thanks for help!
I'm glad you managed to solve it!
I'll add our solution here if someone will have the same issue In our case:
- We didn't change port as described in #1547
- We use Calico CNI, not AWS CNI
- All traffic between EKS Master and workers allowed
- We just had to move VPA PODs to host network:
autoscaler/vertical-pod-autoscaler/deploy$ git diff . diff --git a/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml b/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml index da47060..5236d8d 100644 --- a/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml +++ b/vertical-pod-autoscaler/deploy/admission-controller-deployment.yaml @@ -15,6 +15,8 @@ spec: app: vpa-admission-controller spec: serviceAccountName: vpa-admission-controller + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet containers: - name: admission-controller image: k8s.gcr.io/vpa-admission-controller:0.6.3 diff --git a/vertical-pod-autoscaler/deploy/recommender-deployment.yaml b/vertical-pod-autoscaler/deploy/recommender-deployment.yaml index e0b0356..32cf90b 100644 --- a/vertical-pod-autoscaler/deploy/recommender-deployment.yaml +++ b/vertical-pod-autoscaler/deploy/recommender-deployment.yaml @@ -21,6 +21,8 @@ spec: app: vpa-recommender spec: serviceAccountName: vpa-recommender + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet containers: - name: recommender image: k8s.gcr.io/vpa-recommender:0.6.3 diff --git a/vertical-pod-autoscaler/deploy/updater-deployment.yaml b/vertical-pod-autoscaler/deploy/updater-deployment.yaml index 47d1241..21f06aa 100644 --- a/vertical-pod-autoscaler/deploy/updater-deployment.yaml +++ b/vertical-pod-autoscaler/deploy/updater-deployment.yaml @@ -21,6 +21,8 @@ spec: app: vpa-updater spec: serviceAccountName: vpa-updater + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet containers: - name: updater image: k8s.gcr.io/vpa-updater:0.6.3
@bskiba thanks for help!
@GreyOBox, That worked for me too. Thanks for leaving the solution here!
Kubernetes:
VPA: Fresh master (0.6.3)
Deployed to Kubernetes as described in docs, all VPA pods are running, service is ok, no errors in logs. Deployed example from docs
kubectl create -f examples/hamster.yaml
First few hours
kubectl describe vpa
shows empty Status and Events. Invpa-recommender
logs I see only errors like:No errors in other VPA logs.
After few hours
kubectl describe vpa
shows following:and VPA starts to restart hamster pods continuously (every few minutes - restart):
Checked FAQ - vpa restarts my pods too, everything looks fine.
Enabled debug logs:
but VPA service is available. All other curl lines in log shows HTTP 200. Like that:
Also:
No idea why there's
hamster-vpa-hamster
instead ofhamster-vpa
.In
vpa-updater
logs repeated lines like:(POD name changes every time)
Nothing interesting in
vpa-admission-controller
logs.And VPA doesn't update resources for hamster PODs: