kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.05k stars 3.97k forks source link

Ways for VPA recommender to support other sources of data. #5153

Closed lallydd closed 1 year ago

lallydd commented 2 years ago

Which component are you using?: vertical-pod-autoscaler/pkg/recommender

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: We collect our own metrics separately. metrics-server is redundant.

Describe the solution you'd like.: Enable some way:

To have the recommender use a separate metrics source than metrics-server.

Describe any alternative solutions you've considered.: Nothing good comes to mind. We have a custom fork that's almost all redundant code with the stock recommender.

Additional context.: In our case, Datadog already collects metrics for all of our Kubernetes clusters, and has a simple API to get the metric values. Feeding metrics from the API to the recommender is substantially easier than maintaining a deployment to metrics-server.

jbartosik commented 2 years ago

Can you come to SIG meeting to talk about this?

lallydd commented 2 years ago

Hi! From the SIG meeting, this is my recollection (please correct me when I've remembered incorrectly) of the discussion:

I propose using the metrics.v1beta1.PodMetricsGetter and its included PodMetricsInterface as the interface:


// PodMetricsesGetter has a method to return a PodMetricsInterface.
// A group's client should implement this interface.
type PodMetricsesGetter interface {
    PodMetricses(namespace string) PodMetricsInterface
}

// PodMetricsInterface has methods to work with PodMetrics resources.
type PodMetricsInterface interface {
    Get(ctx context.Context, name string, opts v1.GetOptions) (*v1beta1.PodMetrics, error)
    List(ctx context.Context, opts v1.ListOptions) (*v1beta1.PodMetricsList, error)
    Watch(ctx context.Context, opts v1.ListOptions) (watch.Interface, error)
    PodMetricsExpansion
}

It's installed as the resourceClient.PodMetricsGetter arg in the call to input_metrics.NewMetricsClient() in recommender/main.go.

lallydd commented 2 years ago

RE: the automated tests, I've used a recorded result from the Datadog API as an offline test.

jbartosik commented 2 years ago

Hi, I took a look. I think I need to look some more. I'm still thinking / asking around. I'm not very confident in what I think but I thought I should let you know:

  1. I'm not sure about using PodMetricsesGetter: a. We don't control the interface. If we need to update the dependency and the interface changes we'll need to update all metric providers b. It might be bigger than we need, MetricsClient might be all we need (I still need to check if it's all we need),
    1. I think we also need to tell VPA which metrics provider it should use. How do we do that? My first thought is have a function which returns implementation chosen by a flag. I want to check how CA(Cluster Autoscaler) does this with cloud providers.
    2. I want to check how CA allows excluding cloud providers from some of its builds (IIRC someone said during SIG meeting that they do)
    3. I want to check what images CA releases with regards to built in cloud providers (and follow their example, to check if we'll need to provide more images in release)
jbartosik commented 2 years ago

I talked about this with @x13n

Daniel pointed out that rather that setting up support for DataDog specifically we could instead add support for external & custom metrics to VPA.

I like the idea:

What do you think about adding support for custom / external metrics to VPA as a way to support using Datadog metrics?

lallydd commented 2 years ago

So, roughly, does this mean using this interface https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/metrics/interfaces.go#L40-L57 inside the recommender?

I see the appeal, and it seeems reasonable. My only concern is the user configuration for the metric. Generally we only want the measurement of each resource that corresponds to its recommendation (e.g., RSS for RAM, top(1) CPU usage for CPU). Is there some way to avoid or strongly discourage misconfiguration?

Or would you configure the recommender to just use the External Metrics Provider and there would be some automagic (e.g., standardized names that each provider would map to their own internal names) metric name that did the right thing?

jbartosik commented 2 years ago

Yes, VPA recommender would use MetricsClient interface to get metrics.

I'm not sure I understand your question about misconfiguration.

Are you considering which of two approaches:

  1. Allow users to configure which metrics to use on per-VPA-object level
  2. Allow configuring VPA recommender which metrics to use (for all VPA objects it generates recommendations for)

?

lallydd commented 2 years ago

For configuration, different measurement systems (e.g., metrics-server, Datadog, others) will likely have different names for container CPU / Memory usage. So we'll have to configure that somehow.

As for the two approaches, I think (2) is likely necessary - we shouldn't require each VerticalPodAutoscaler to have more configuration when it's all the same for every instance in a cluster. As for (1), I think that's a separate and longer discussion.

lallydd commented 2 years ago

Clarifying: when calling

 GetExternalMetric(metricName string, namespace string, selector labels.Selector)

The metricName arg needs a value, and we have to call it twice, once for CPU metrics, once for RAM.

So, if the recommender took three more command-line args:

  --use-external-metrics
  --external-metrics-cpu-metric=kubernetes.cpu.usage.total
  --external-metrics-memory-metric=kubernetes.memory.usage

This could configure the recommender to both call the external metrics provider and use the right values for the metricName parameter of GetExternalMetric. I used the appropriate Datadog metric names here as an example.

jbartosik commented 2 years ago

Sounds good to me. We could look at HPAs e2e tests here (dashboard) and have similar tests to ensure we notice if the feature breaks.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

lallydd commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

lallydd commented 1 year ago

/remove-lifecycle stale

On Sun, May 7, 2023, 2:50 PM Kubernetes Triage Robot < @.***> wrote:

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Please send feedback to sig-contributor-experience at kubernetes/community https://github.com/kubernetes/community.

/lifecycle stale

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/autoscaler/issues/5153#issuecomment-1537515753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOK7D5LP577BNSWD7LWULNTXE7VGLANCNFSM6AAAAAAQBTD7EQ . You are receiving this because you authored the thread.Message ID: @.***>

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/5153#issuecomment-1624331629): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
jbartosik commented 1 year ago

/remove-lifecycle rotten