kubecost / features-bugs

A public repository for filing of Kubecost feature requests and bugs. Please read the issue guidelines before filing an issue here.
0 stars 0 forks source link

[Feature] Support `algorithmCPU` parameter for "Continuous Request Right-Sizing" #23

Open janslow opened 8 months ago

janslow commented 8 months ago

Problem Statement

The "Continuous Request Right-Sizing" currently uses the max algorithm for recommendations, which causes services with high start-up CPU usage to be overprovisioned.

For example, some of our services spike to ~2 cores at start-up, then drop down to ~0.3 cores when stable. This has too negative effects:

  1. The CPU requests is over-provisioned, increasing cost (for some services, our CPU efficiency has dropped to 1% since enabling this feature)
  2. If a pod is left running for longer than the window, the start-up spike will no longer be considered, so it will be right-sized down to a reasonable 0.3 CPU. However, that causes the pods to be re-created, re-introducing the start-up spike, so it'll then be "right-sized" back to 2 CPU

Solution Description

Introduce cpu.request.autoscaling.kubecost.com/algorithm and cpu.request.autoscaling.kubecost.com/q annotations (or similar) to allow the algorithmCPU and qCPU right-size recommendations parameters to be set on a per-workload basis.

It probably makes sense to introduce this for memory as well, for consistency.

Alternatives

Allow arbitrary query parameters to be added to the recommendation API requests (e.g., request.autoscaling.kubecost.com/extraRecommendationParameters: "algorithmCPU=quantile&qCPU=0.95)

This could be useful for allowing the use of alpha/experimental parameters, without making it part of the Cluster Controller's API.

Additional Context

No response

Troubleshooting

AjayTripathy commented 8 months ago

@michaelmdresser any thoughts here?

michaelmdresser commented 8 months ago

This is well-reasoned; I wanted to add quantile algorithm support to Continuous Request Right-Sizing from the start but did not have time. The primary proposed solution is the one I endorse, though I would also be okay with the alternative.