Native sidecars lead to huge amounts of errors in recommender logs

fullykubed commented 6 months ago

Which component are you using?:

vertical-pod-autoscaler

What version of the component are you using?:

Component version: 1.0.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version

Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1-eks-508b6b3

What environment is this in?:

EKS

What did you expect to happen?:

Native sidecars (initContainer with RestartPolicy of Always) should have metric samples collected and applied

What happened instead?:

The recommender generates many errors when the pod contains a native sidecar:

W0404 17:28:12.384947       1 cluster_feeder.go:433] Error adding metric sample for container {{vertical-pod-autoscaler vpa-updater-84dc96ccf9-959q9} linkerd-proxy}: KeyError: {{vertical-pod-autoscaler vpa-updater-84dc96ccf9-959q9} linkerd-proxy}
W0404 17:28:12.384961       1 cluster_feeder.go:433] Error adding metric sample for container {{vertical-pod-autoscaler vpa-updater-84dc96ccf9-959q9} linkerd-proxy}: KeyError: {{vertical-pod-autoscaler vpa-updater-84dc96ccf9-959q9} linkerd-proxy}
W0404 17:29:12.319116       1 cluster_feeder.go:433] Error adding metric sample for container {{alb-controller alb-controller-c8d69bfd-npscc} linkerd-proxy}: KeyError: {{alb-controller alb-controller-c8d69bfd-npscc} linkerd-proxy}
W0404 17:29:12.319279       1 cluster_feeder.go:433] Error adding metric sample for container {{alb-controller alb-controller-c8d69bfd-npscc} linkerd-proxy}: KeyError: {{alb-controller alb-controller-c8d69bfd-npscc} linkerd-proxy}
W0404 17:29:12.319378       1 cluster_feeder.go:433] Error adding metric sample for container {{alb-controller alb-controller-c8d69bfd-zg277} linkerd-proxy}: KeyError: {{alb-controller alb-controller-c8d69bfd-zg277} linkerd-proxy}
W0404 17:29:12.319442       1 cluster_feeder.go:433] Error adding metric sample for container {{alb-controller alb-controller-c8d69bfd-zg277} linkerd-proxy}: KeyError: {{alb-controller alb-controller-c8d69bfd-zg277} linkerd-proxy}
W0404 17:29:12.319541       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik authentik-server-5db94f9788-955sv} linkerd-proxy}: KeyError: {{authentik authentik-server-5db94f9788-955sv} linkerd-proxy}
W0404 17:29:12.319599       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik authentik-server-5db94f9788-955sv} linkerd-proxy}: KeyError: {{authentik authentik-server-5db94f9788-955sv} linkerd-proxy}
W0404 17:29:12.319705       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik authentik-server-5db94f9788-ds4sr} linkerd-proxy}: KeyError: {{authentik authentik-server-5db94f9788-ds4sr} linkerd-proxy}
W0404 17:29:12.319764       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik authentik-server-5db94f9788-ds4sr} linkerd-proxy}: KeyError: {{authentik authentik-server-5db94f9788-ds4sr} linkerd-proxy}
W0404 17:29:12.319804       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik authentik-worker-66998c9459-9dfs6} linkerd-proxy}: KeyError: {{authentik authentik-worker-66998c9459-9dfs6} linkerd-proxy}
W0404 17:29:12.319866       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik authentik-worker-66998c9459-9dfs6} linkerd-proxy}: KeyError: {{authentik authentik-worker-66998c9459-9dfs6} linkerd-proxy}
W0404 17:29:12.319961       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik redis-622b-node-0} linkerd-proxy}: KeyError: {{authentik redis-622b-node-0} linkerd-proxy}
W0404 17:29:12.320051       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik redis-622b-node-0} linkerd-proxy}: KeyError: {{authentik redis-622b-node-0} linkerd-proxy}
W0404 17:29:12.320162       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik redis-622b-node-1} linkerd-proxy}: KeyError: {{authentik redis-622b-node-1} linkerd-proxy}
W0404 17:29:12.320219       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik redis-622b-node-1} linkerd-proxy}: KeyError: {{authentik redis-622b-node-1} linkerd-proxy}
W0404 17:29:12.320309       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik redis-622b-node-2} linkerd-proxy}: KeyError: {{authentik redis-622b-node-2} linkerd-proxy}
W0404 17:29:12.320362       1 cluster_feeder.go:433] Error adding metric sample for container {{authentik redis-622b-node-2} linkerd-proxy}: KeyError: {{authentik redis-622b-node-2} linkerd-proxy}
W0404 17:29:12.320457       1 cluster_feeder.go:433] Error adding metric sample for container {{aws-ebs-csi-driver ebs-csi-controller-59bdd4f68d-fxzvw} linkerd-proxy}: KeyError: {{aws-ebs-csi-driver ebs-csi-controller-59bdd4f68d-fxzvw} linkerd-proxy}
W0404 17:29:12.320506       1 cluster_feeder.go:433] Error adding metric sample for container {{aws-ebs-csi-driver ebs-csi-controller-59bdd4f68d-fxzvw} linkerd-proxy}: KeyError: {{aws-ebs-csi-driver ebs-csi-controller-59bdd4f68d-fxzvw} linkerd-proxy}
W0404 17:29:12.320657       1 cluster_feeder.go:433] Error adding metric sample for container {{aws-ebs-csi-driver ebs-csi-controller-59bdd4f68d-zp68v} linkerd-proxy}: KeyError: {{aws-ebs-csi-driver ebs-csi-controller-59bdd4f68d-zp68v} linkerd-proxy}
W0404 17:29:12.320723       1 cluster_feeder.go:433] Error adding metric sample for container {{aws-ebs-csi-driver ebs-csi-controller-59bdd4f68d-zp68v} linkerd-proxy}: KeyError: {{aws-ebs-csi-driver ebs-csi-controller-59bdd4f68d-zp68v} linkerd-proxy}
W0404 17:29:12.321375       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-585d84f6ff-cdxsh} linkerd-proxy}: KeyError: {{cert-manager cert-manager-585d84f6ff-cdxsh} linkerd-proxy}
W0404 17:29:12.321449       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-585d84f6ff-cdxsh} linkerd-proxy}: KeyError: {{cert-manager cert-manager-585d84f6ff-cdxsh} linkerd-proxy}
W0404 17:29:12.321545       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-585d84f6ff-kcb72} linkerd-proxy}: KeyError: {{cert-manager cert-manager-585d84f6ff-kcb72} linkerd-proxy}
W0404 17:29:12.321604       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-585d84f6ff-kcb72} linkerd-proxy}: KeyError: {{cert-manager cert-manager-585d84f6ff-kcb72} linkerd-proxy}
W0404 17:29:12.321694       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-cainjector-66fc5bbd98-4tw2f} linkerd-proxy}: KeyError: {{cert-manager cert-manager-cainjector-66fc5bbd98-4tw2f} linkerd-proxy}
W0404 17:29:12.321757       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-cainjector-66fc5bbd98-4tw2f} linkerd-proxy}: KeyError: {{cert-manager cert-manager-cainjector-66fc5bbd98-4tw2f} linkerd-proxy}
W0404 17:29:12.321865       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-cainjector-66fc5bbd98-d6ng2} linkerd-proxy}: KeyError: {{cert-manager cert-manager-cainjector-66fc5bbd98-d6ng2} linkerd-proxy}
W0404 17:29:12.322028       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-cainjector-66fc5bbd98-d6ng2} linkerd-proxy}: KeyError: {{cert-manager cert-manager-cainjector-66fc5bbd98-d6ng2} linkerd-proxy}
W0404 17:29:12.322140       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-webhook-57b998d467-ft7l9} linkerd-proxy}: KeyError: {{cert-manager cert-manager-webhook-57b998d467-ft7l9} linkerd-proxy}
W0404 17:29:12.322163       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-webhook-57b998d467-ft7l9} linkerd-proxy}: KeyError: {{cert-manager cert-manager-webhook-57b998d467-ft7l9} linkerd-proxy}
W0404 17:29:12.322189       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-webhook-57b998d467-qjsdc} linkerd-proxy}: KeyError: {{cert-manager cert-manager-webhook-57b998d467-qjsdc} linkerd-proxy}
W0404 17:29:12.322202       1 cluster_feeder.go:433] Error adding metric sample for container {{cert-manager cert-manager-webhook-57b998d467-qjsdc} linkerd-proxy}: KeyError: {{cert-manager cert-manager-webhook-57b998d467-qjsdc} linkerd-proxy}
W0404 17:29:12.322350       1 cluster_feeder.go:433] Error adding metric sample for container {{cloudnative-pg cloudnative-pg-6d4bd4d6dd-wmjmz} linkerd-proxy}: KeyError: {{cloudnative-pg cloudnative-pg-6d4bd4d6dd-wmjmz} linkerd-proxy}
W0404 17:29:12.322367       1 cluster_feeder.go:433] Error adding metric sample for container {{cloudnative-pg cloudnative-pg-6d4bd4d6dd-wmjmz} linkerd-proxy}: KeyError: {{cloudnative-pg cloudnative-pg-6d4bd4d6dd-wmjmz} linkerd-proxy}
W0404 17:29:12.322380       1 cluster_feeder.go:433] Error adding metric sample for container {{cloudnative-pg cloudnative-pg-6d4bd4d6dd-xmdjr} linkerd-proxy}: KeyError: {{cloudnative-pg cloudnative-pg-6d4bd4d6dd-xmdjr} linkerd-proxy}
W0404 17:29:12.322393       1 cluster_feeder.go:433] Error adding metric sample for container {{cloudnative-pg cloudnative-pg-6d4bd4d6dd-xmdjr} linkerd-proxy}: KeyError: {{cloudnative-pg cloudnative-pg-6d4bd4d6dd-xmdjr} linkerd-proxy}
W0404 17:29:12.322432       1 cluster_feeder.go:433] Error adding metric sample for container {{descheduler descheduler-858756ccb-5q522} linkerd-proxy}: KeyError: {{descheduler descheduler-858756ccb-5q522} linkerd-proxy}
W0404 17:29:12.322445       1 cluster_feeder.go:433] Error adding metric sample for container {{descheduler descheduler-858756ccb-5q522} linkerd-proxy}: KeyError: {{descheduler descheduler-858756ccb-5q522} linkerd-proxy}
W0404 17:29:12.322471       1 cluster_feeder.go:433] Error adding metric sample for container {{descheduler descheduler-858756ccb-9djhf} linkerd-proxy}: KeyError: {{descheduler descheduler-858756ccb-9djhf} linkerd-proxy}
W0404 17:29:12.322485       1 cluster_feeder.go:433] Error adding metric sample for container {{descheduler descheduler-858756ccb-9djhf} linkerd-proxy}: KeyError: {{descheduler descheduler-858756ccb-9djhf} linkerd-proxy}
W0404 17:29:12.322512       1 cluster_feeder.go:433] Error adding metric sample for container {{external-dns external-dns-adf33d646e2cbc4c-6d9f465c88-xsqg6} linkerd-proxy}: KeyError: {{external-dns external-dns-adf33d646e2cbc4c-6d9f465c88-xsqg6} linkerd-proxy}
W0404 17:29:12.322527       1 cluster_feeder.go:433] Error adding metric sample for container {{external-dns external-dns-adf33d646e2cbc4c-6d9f465c88-xsqg6} linkerd-proxy}: KeyError: {{external-dns external-dns-adf33d646e2cbc4c-6d9f465c88-xsqg6} linkerd-proxy}
W0404 17:29:12.322555       1 cluster_feeder.go:433] Error adding metric sample for container {{external-dns external-dns-df11375e15f02742-79c99d7749-kvxwt} linkerd-proxy}: KeyError: {{external-dns external-dns-df11375e15f02742-79c99d7749-kvxwt} linkerd-proxy}

The metrics are not recorded in the VPACheckpoints.

How to reproduce it (as minimally and precisely as possible):

Create a deployment with a native sidecar
Assign a VPA to it
See the errors

Anything else we need to know?:

No

voelzmo commented 6 months ago

Hey @fullykubed thanks for the detailed description! You're writing about "native sidecars", which makes me think that you're saying those sidecars are not injected during runtime. But your log output seems to contain lots of linkerd-proxy containers, which afaik are injected? I might misunderstand how linkerd works, sorry for my limited experience with it.

If this is a case of VPA not providing recommendations for injected sidecar containers: this is a feature, not a bug. In https://github.com/kubernetes/autoscaler/issues/5617 we have a more detailed discussion around why that feature was built and how we could help people who decide that they do want VPA managing their injected containers.

I propose we take our discussion there and close this ticket. Feel free to re-open with additional information, in case I'm misunderstand what you described.

/close

k8s-ci-robot commented 6 months ago

@voelzmo: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/6691#issuecomment-2039259960): >Hey @fullykubed thanks for the detailed description! >You're writing about "native sidecars", which makes me think that you're saying those sidecars *are not* injected during runtime. But your log output seems to contain lots of `linkerd-proxy` containers, which afaik are injected? I might misunderstand how linkerd works, sorry for my limited experience with it. > >If this is a case of VPA not providing recommendations for injected sidecar containers: this is a feature, not a bug. In https://github.com/kubernetes/autoscaler/issues/5617 we have a more detailed discussion around why that feature was built and how we could help people who decide that they _do_ want VPA managing their injected containers. > >I propose we take our discussion there and close this ticket. Feel free to re-open with additional information, in case I'm misunderstand what you described. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

fullykubed commented 6 months ago

Thanks for the feedback and pointing me to the discussion @voelzmo .

For further context here, the "native sidecar" functionality I am referencing is a new Kubernetes feature that was enabled by default in 1.29. These sidecars are still injected but they run as init containers rather than normal containers. Docs here.

Based on the discussion you linked, it makes sense why they (or normal sidecars) wouldn't be tracked. However, these error logs started to appear in abundance only after starting to use the native sidecar functionality. I think that means something unexpected is happening in the recommender here as it would seem odd to generate 100+ of these log lines per minute on v=1, but I will defer to you about whether this is the intended behavior.

voelzmo commented 5 months ago

Hey @fullykubed thanks for the docs on native sidecars, this was something I apparently missed entirely! Understanding now that this is about long-running init containers, I understand why you're seeing the errors you're describing this often: VPA only looks at regular containers only when creating its internal data structures, which results in issues later on when adding the metrics.

Would evicting and recreating the Pod be something that you'd want in order to get better resource recommendations for those native sidecars? Or is this something that VPA should ignore, just like it does now, but without polluting the log with error messages?

/reopen /title Native sidecars lead to huge amounts of errors in recommender logs

k8s-ci-robot commented 5 months ago

@voelzmo: Reopened this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/6691#issuecomment-2082239644): >Hey @fullykubed thanks for the docs on native sidecars, this was something I apparently missed entirely! Understanding now that this is about long-running init containers, I understand why you're seeing the errors you're describing this often: >VPA only [looks at regular containers only when creating its internal data structures](https://github.com/kubernetes/autoscaler/blob/4f1c8e69a8a4031a531596c26718a262d4b6b716/vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go#L410), which results in issues later on when adding the metrics. > >Would evicting and recreating the Pod be something that you'd want in order to get better resource recommendations for those native sidecars? Or is this something that VPA should ignore, just like it does now, but without polluting the log with error messages? > >/reopen >/title Native sidecars lead to huge amounts of errors in recommender logs Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

voelzmo commented 5 months ago

/retitle Native sidecars lead to huge amounts of errors in recommender logs

voelzmo commented 5 months ago

/triage accepted

fullykubed commented 5 months ago

Hi @voelzmo , no worries!

My initial goal was to have these sidecars tracked and then cause pod evictions when appropriate just as with normal containers.

I am having a little trouble understanding whether this is possible based on #5617. I am noticing that the sidecars are not in the vpaObservedContainers annotation. However, based on your comment here it seems like they should be as the VPA mutating webhook runs after the linkerd sidecar injector (my understanding is that they run in alphabetical order). What seems even odder is that even though they aren't "observed," for some reason they are still generating these error messages.

If it isn't possible yet, it would be nice to at least not generate the error messages.

voelzmo commented 4 months ago

I think right now, VPA doesn't even look at anything defined in initContainers and just looks at the containers section of the spec. Therefore, all the mechanisms around the annotation I mentioned above don't work for native sidecars.

We now have two options

make sure that we ignore metrics of initContainers when adding them, such that we don't spam KeyErrors all over the place
add a new feature to have VPA also work for native sidecars (meaning: no longer blindly ignore initContainers)

The second option might be much harder to achieve, so I guess for now it would make sense to merge a small change taking care of the KeyErrors and then think about if it makes sense to also support native sidecars.

fullykubed commented 4 months ago

I agree with that assessment!

I do think that as long-running initContainers are going to become the norm for users of service meshes, this change in the ecosystem will cause a fairly impactful regression in the utility of the VPA.

As I lack the historical context, is there a reason that initContainers are ignored? It looks like tracking initContainers has been requested a couple times over the years, but never had any definitive action taken.

Taking a quick look at the internals, it seems like the VPA already accounts for containers in pods that are not currently running, but perhaps there is another blocker to this functionality I am not considering?

I'd be interested in helping to bring this to fruition including building the PoC code. What would be the best way to propose and get approval for such an enhancement?

fullykubed commented 4 months ago

Also note that when using the prometheus history provider, long-running initContainers are are a part of the returned metrics and get added to the VPA object as recommendations, but they do not get applied.

This doesn't cause any errors, but it can be confusing to see a recommendation but not see it applied.

adrianmoisey commented 2 months ago

/area vertical-pod-autoscaler

kubernetes / autoscaler

Native sidecars lead to huge amounts of errors in recommender logs #6691