Timing issue when loading history from AWS AMP

Evedel commented 10 months ago

Which component are you using?: Vertical Pod Autoscaler (Recommender only)

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: In my case, VPA runs with AWS AMP as a history provider. Pods also use iam role-based permissions. That means that the VPA recommender deployment consist of two containers:

aws-sigv4-proxy (repo, aws docs) with:

  - --host
  - aps-workspaces.${REGION}.amazonaws.com
  - --port
  - :8005

VPA recommender with
```
   - --storage=prometheus
   - --prometheus-address=http://localhost:8005/workspaces/${WORKSPACE_ID}
```
It works when proxy starts faster then recommender, no error in logs at all on any verbosity level and memory consumption is ~2Gi.

However, there is some timing issue. Approximately half of the time the recommender starts faster than the proxy, try to load history, fails on the first query with a connection error, and exists from loading history at all. Memory consumption is then ~50Mi. Recommendations are still given within approximately the same ranges.

I can see the error in logs, saying that:

Cannot get cluster history: cannot get usage history: cannot get timeseries for cpu: Post "http://localhost:8005/workspaces/${WORKSPACE_ID}/api/v1/query_range": dial tcp localhost:80: connect: connection refused

The error is from this "stacktrace": https://github.com/kubernetes/autoscaler/blob/e1b03fac9958791790bfc18eeba9fab5cac0ccc1/vertical-pod-autoscaler/pkg/recommender/main.go#L188

https://github.com/kubernetes/autoscaler/blob/e1b03fac9958791790bfc18eeba9fab5cac0ccc1/vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go#L199

https://github.com/kubernetes/autoscaler/blob/e1b03fac9958791790bfc18eeba9fab5cac0ccc1/vertical-pod-autoscaler/pkg/recommender/input/history/history_provider.go#L216

Describe the solution you'd like.: As the recommender works fine without historical data, I would ask if it is possible to add an argument to skip the history initialisation explicitly.

An alternative/opposite solution might be to strictly require Prometheus history initialisation .

Please let me know what you think. Also, I would be keen to implement/contribute the solution.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Shubham82 commented 4 months ago

/remove-lifecycle rotten /triage accepted

kubernetes / autoscaler

Timing issue when loading history from AWS AMP #6050