Ways for VPA recommender to support other sources of data.

lallydd commented 2 years ago

Which component are you using?: vertical-pod-autoscaler/pkg/recommender

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: We collect our own metrics separately. metrics-server is redundant.

Describe the solution you'd like.: Enable some way:

via RPC to another metrics client system
some plugin mechanism
(or something else)

To have the recommender use a separate metrics source than metrics-server.

Describe any alternative solutions you've considered.: Nothing good comes to mind. We have a custom fork that's almost all redundant code with the stock recommender.

Additional context.: In our case, Datadog already collects metrics for all of our Kubernetes clusters, and has a simple API to get the metric values. Feeding metrics from the API to the recommender is substantially easier than maintaining a deployment to metrics-server.

jbartosik commented 2 years ago

Can you come to SIG meeting to talk about this?

I think it makes sense to allow getting data from different sources.
I won't have capacity to design or implement this but I can help (discuss, review, make suggestions).
I'm worried about testing and maintenance:
- Can we have automated e2e tests to make sure this keeps working?
- I don't have any experience with Datadog so what happens if it turns out it needs some work (likely I won't be able to help)?
- This reminds me of discussions Cluster Autoscaler recently were having about their support for cloud providers, maybe we can ask them what they worked out and see if something similar would work here?

lallydd commented 2 years ago

Hi! From the SIG meeting, this is my recollection (please correct me when I've remembered incorrectly) of the discussion:

The VPA recommender can follow the cluster-autoscaler model, where different metrics sources provide different implementations of an interface
3rd parties (like us at Datadog) may provide alternative implementations of the interface
The implementations must (a) compile and (b) not have too many dependencies and (c) not have conflicting dependencies with what's already in the recommender
The 3rd parties are responsible for the interface implementations.
Interface discussion is ongoing, but can start with a proposal in this Issue.

I propose using the metrics.v1beta1.PodMetricsGetter and its included PodMetricsInterface as the interface:


// PodMetricsesGetter has a method to return a PodMetricsInterface.
// A group's client should implement this interface.
type PodMetricsesGetter interface {
    PodMetricses(namespace string) PodMetricsInterface
}

// PodMetricsInterface has methods to work with PodMetrics resources.
type PodMetricsInterface interface {
    Get(ctx context.Context, name string, opts v1.GetOptions) (*v1beta1.PodMetrics, error)
    List(ctx context.Context, opts v1.ListOptions) (*v1beta1.PodMetricsList, error)
    Watch(ctx context.Context, opts v1.ListOptions) (watch.Interface, error)
    PodMetricsExpansion
}

It's installed as the resourceClient.PodMetricsGetter arg in the call to input_metrics.NewMetricsClient() in recommender/main.go.

lallydd commented 2 years ago

RE: the automated tests, I've used a recorded result from the Datadog API as an offline test.

jbartosik commented 2 years ago

Hi, I took a look. I think I need to look some more. I'm still thinking / asking around. I'm not very confident in what I think but I thought I should let you know:

I'm not sure about using PodMetricsesGetter: a. We don't control the interface. If we need to update the dependency and the interface changes we'll need to update all metric providers b. It might be bigger than we need, MetricsClient might be all we need (I still need to check if it's all we need),
1. I think we also need to tell VPA which metrics provider it should use. How do we do that? My first thought is have a function which returns implementation chosen by a flag. I want to check how CA(Cluster Autoscaler) does this with cloud providers.
2. I want to check how CA allows excluding cloud providers from some of its builds (IIRC someone said during SIG meeting that they do)
3. I want to check what images CA releases with regards to built in cloud providers (and follow their example, to check if we'll need to provide more images in release)

jbartosik commented 2 years ago

I talked about this with @x13n

Daniel pointed out that rather that setting up support for DataDog specifically we could instead add support for external & custom metrics to VPA.

I like the idea:

It looks like Datadog supports this:
- here is an article. It refers to HPA but we can add support for custom / external metrics to VPA and get those metrics the same way.
I think it will be better for adding new metric sources (if someone wants to add new metric source they implement support for 1 API, no need to do work for VPA in particular).
It's less work to add just support in this case. To minimize problems with security and compilation errors "metric providers" I initially suggested should run as sidecars (and that's likely a lot of work).
Generally it looks like a much cleaner design.

What do you think about adding support for custom / external metrics to VPA as a way to support using Datadog metrics?

lallydd commented 2 years ago

So, roughly, does this mean using this interface https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/metrics/interfaces.go#L40-L57 inside the recommender?

I see the appeal, and it seeems reasonable. My only concern is the user configuration for the metric. Generally we only want the measurement of each resource that corresponds to its recommendation (e.g., RSS for RAM, top(1) CPU usage for CPU). Is there some way to avoid or strongly discourage misconfiguration?

Or would you configure the recommender to just use the External Metrics Provider and there would be some automagic (e.g., standardized names that each provider would map to their own internal names) metric name that did the right thing?

jbartosik commented 2 years ago

Yes, VPA recommender would use MetricsClient interface to get metrics.

I'm not sure I understand your question about misconfiguration.

Are you considering which of two approaches:

Allow users to configure which metrics to use on per-VPA-object level
Allow configuring VPA recommender which metrics to use (for all VPA objects it generates recommendations for)

?

lallydd commented 2 years ago

For configuration, different measurement systems (e.g., metrics-server, Datadog, others) will likely have different names for container CPU / Memory usage. So we'll have to configure that somehow.

As for the two approaches, I think (2) is likely necessary - we shouldn't require each VerticalPodAutoscaler to have more configuration when it's all the same for every instance in a cluster. As for (1), I think that's a separate and longer discussion.

lallydd commented 2 years ago

Clarifying: when calling

 GetExternalMetric(metricName string, namespace string, selector labels.Selector)

The metricName arg needs a value, and we have to call it twice, once for CPU metrics, once for RAM.

So, if the recommender took three more command-line args:

  --use-external-metrics
  --external-metrics-cpu-metric=kubernetes.cpu.usage.total
  --external-metrics-memory-metric=kubernetes.memory.usage

This could configure the recommender to both call the external metrics provider and use the right values for the metricName parameter of GetExternalMetric. I used the appropriate Datadog metric names here as an example.

jbartosik commented 2 years ago

Sounds good to me. We could look at HPAs e2e tests here (dashboard) and have similar tests to ensure we notice if the feature breaks.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

lallydd commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

lallydd commented 1 year ago

/remove-lifecycle stale

On Sun, May 7, 2023, 2:50 PM Kubernetes Triage Robot < @.***> wrote:

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale

Close this issue with /close

Offer to help out with Issue Triage https://www.kubernetes.dev/docs/guide/issue-triage/

Please send feedback to sig-contributor-experience at kubernetes/community https://github.com/kubernetes/community.

/lifecycle stale

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/autoscaler/issues/5153#issuecomment-1537515753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOK7D5LP577BNSWD7LWULNTXE7VGLANCNFSM6AAAAAAQBTD7EQ . You are receiving this because you authored the thread.Message ID: @.***>

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/5153#issuecomment-1624331629): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

jbartosik commented 1 year ago

/remove-lifecycle rotten

kubernetes / autoscaler

Ways for VPA recommender to support other sources of data. #5153