Closed liugb1029 closed 1 year ago
Can you give me some more information?
I can see that pods are called hamster. Does it mean you're running vpa-full
tests and you're seeing the problem in those tests?
Same observation here. I have a dozen of VPA in my cluster, some of them are in Auto update mode. On one VPA (targeted to a daemonset for a Datadog agent that has 4 containers), pods are being evicted by the VPA updater but the mutating webhook does not patch anything.
There was no such behavior with v0.9.2
I had some time to gather some data. Let me know if you need me to investigate further.
VPA object:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: datadog
namespace: monitoring
spec:
resourcePolicy:
containerPolicies:
- containerName: trace-agent
maxAllowed:
cpu: 700m
memory: 700Mi
minAllowed:
cpu: 100m
memory: 100Mi
- containerName: process-agent
maxAllowed:
cpu: 700m
memory: 700Mi
minAllowed:
cpu: 100m
memory: 100Mi
- containerName: agent
maxAllowed:
cpu: 700m
memory: 700Mi
minAllowed:
cpu: 100m
memory: 100Mi
targetRef:
apiVersion: apps/v1
kind: DaemonSet
name: datadog
updatePolicy:
updateMode: Auto
Logs (formatted) from the VPA admission controller (verbosity level -v=20
):
Sending patches: [
{add /spec/containers/0/resources/requests/cpu 224m}
{add /spec/containers/0/resources/requests/memory 203699302}
{add /spec/containers/0/resources/limits/cpu 1120m}
{add /spec/containers/0/resources/limits/memory 1018496510}
{add /spec/containers/1/resources/requests/cpu 100m}
{add /spec/containers/1/resources/requests/memory 100Mi}
{add /spec/containers/1/resources/limits/cpu 500m}
{add /spec/containers/1/resources/limits/memory 500Mi}
{add /spec/containers/2/resources/requests/cpu 100m}
{add /spec/containers/2/resources/requests/memory 100Mi}
{add /spec/containers/2/resources/limits/cpu 500m}
{add /spec/containers/2/resources/limits/memory 500Mi}
{add /metadata/annotations/vpaUpdates Pod resources updated by datadog: container 0: cpu request, memory request, cpu limit, memory limit; container 1: cpu request, memory request, cpu limit, memory limit; container 2: cpu request, memory request, cpu limit, memory limit}
{add /metadata/annotations/vpaObservedContainers agent, process-agent, trace-agent}]
Screenshot from the API-server audit-log, stored in AWS Cloudwatch:
Exactly the same setup, but just changing the tag of the image used in the "vpa-admission-controller" deployment from 0.10.0 to 0.9.2:
I have enabled the API server logs to see if it would complain about something not expected in the response from the mutating webhook, but nothing showed up as far as I can tell. I searched for "vpa-webhook-config" in those logs.
We observed this issue while testing gardener with VPA 0.10.0 as well. We saw the following error in kube-apiserver
logs:
Failed calling webhook, failing open vpa.k8s.io: failed calling webhook "vpa.k8s.io": converting (v1.AdmissionReview) to (v1beta1.AdmissionReview): unknown conversion
We realized that our vpa-webhook-config
MutatingWebhookConfiguration
only listed v1beta1
in admissionReviewVersions
, but VPA 0.10.0 is returning v1
:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: vpa-webhook-config-seed
...
webhooks:
- admissionReviewVersions:
- v1beta1
Once we changed the above version to v1
, the errors in the kube-apiserver
logs were gone, and the issue itself disappeared.
It seems that 0.10.0 requires changing to (or adding) the v1
admission review version in the MutatingWebhookConfiguration
. Perhaps this should be considered a breaking change and documented properly?
/cc @voelzmo
I guess this is the commit where the relevant change was introduced.
Thank you for investigating this. I'm trying to reproduce the problem
I tried to reproduce the situation @stoyanr describes in this comment:
And VPA works correctly. Webhook in the cluster is defined with v1. This is because when VPA Admission Controller starts it deletes existing webhook if there is one then creates a new one.
It looks like something else is modifying the webhook after VPA admission controller starts (if Admission Controller would die if it couldn't delete old webhook or create new one).
I'll try modifying webhook after Admission Webhook 0.10.0 starts.
I modified the webhook to use v1beta1 API after Admisison Controller started and now e2e tests I'm running on the cluster are failing. I can see that Updater is evicting pos, recommendations are changing, Admission Controller is processing pods but requests are not changing.
So it looks like something other than VPA Admission Controller is changing the webhook and as a result VPA Admission Controller doesn't work properly.
Another possibility is that you're running VPA Admission controller with registerWebhook set to false
.
Another possibility is that you're running VPA Admission controller with registerWebhook set to false.
Yes, we are doing that, since we need better control over the webhook registration. The point is that it is currently not obvious from the release notes of 0.10.0 that in this case we must update the webhook registration to v1 as well when updating to 0.10.0, so we had to find it the hard way.
I'm using a gitops workflow to deploy and reconcile technical tools on my clusters, so yes, FluxCD is overwriting the mutating admission webhook configuration that VPA has updated on start-up.
I'll change the webhook configuration in the git repository that FluxCD is using. Thanks for the analysis.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
Which component are you using?: vertical-pod-autoscaler
What version of the component are you using?: 0.10.0
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
What did you expect to happen?:
What happened instead?:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: