[vpa] Documentation for configuration options

chris-vest commented 3 years ago

Which component are you using?:

VPA recommender.

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Better user experience and easier debugging.

Describe the solution you'd like.:

Documentation for these VPA recommender configuration options:

--add-dir-header="false"
--address=":8942"
--alsologtostderr="false"
--checkpoints-gc-interval="10m0s"
--checkpoints-timeout="1m0s"
--container-name-label="name"
--container-namespace-label="namespace"
--container-pod-name-label="pod_name"
--cpu-histogram-decay-half-life="24h0m0s"
--history-length="8d"
--history-resolution="1h"
--kube-api-burst="10"
--kube-api-qps="5"
--log-backtrace-at=":0"
--log-dir=""
--log-file=""
--log-file-max-size="1800"
--logtostderr="true"
--memory-aggregation-interval="24h0m0s"
--memory-aggregation-interval-count="8"
--memory-histogram-decay-half-life="24h0m0s"
--memory-saver="false"
--metric-for-pod-labels="up{job=\"kubernetes-pods\"}"
--min-checkpoints="10"
--pod-label-prefix="pod_label_"
--pod-name-label="kubernetes_pod_name"
--pod-namespace-label="kubernetes_namespace"
--pod-recommendation-min-cpu-millicores="15"
--pod-recommendation-min-memory-mb="100"
--prometheus-address="http://thanos-querier.monitoring.svc.cluster.local:9090"
--prometheus-cadvisor-job-name="kubernetes-nodes-cadvisor"
--prometheus-query-timeout="5m"
--recommendation-margin-fraction="0.15"
--recommender-interval="1m0s"
--skip-headers="false"
--skip-log-headers="false"
--stderrthreshold="2"
--storage="prometheus"
--v="6"
--vmodule=""
--vpa-object-namespace=""

Granted, some of these are pretty self-explanatory, but some of not obvious. For example, the pod-label-prefix configuration option - how is that used and do I need to configure it? I know other people might think that, because I certainly did. Users shouldn't have to dig through the code in order to understand what they do.

Describe any alternative solutions you've considered.:

Little to no documentation, as it stands now - I feel like that's not an ideal scenario.

bskiba commented 3 years ago

We do not have up to date documentation of the parameters (I suppose it would get out of date very quickly, bt you can run the binary with the --help option to get the flag description. docker run -it k8s.gcr.io/autoscaling/vpa-recommender:0.9.0 ./vpa-recommender --help

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

jppitout commented 3 years ago

Please could we get more complete documentation on setting up Prometheus as a history provider for the VPA recommender component?

For example how to customize it and verify that it is indeed working? Also including whether CPU and memory queries are customizable or not?

It would be nice if the documentation could include which jobs get queried e.g. is it the "kubernetes-nodes-cadvisor" and "kubernetes-pods", just the one, or are there more?

Running the previously recommended command:

docker run -it k8s.gcr.io/autoscaling/vpa-recommender:0.9.2 ./vpa-recommender --help

The descriptions of these options are too similar i.e. "Label name to look for container names"... are they all looking for container names (or is one looking for pod names)?... are they used in conjunction or either/or? :

      --container-name-label string                   Label name to look for container names (default "name")
      --container-namespace-label string              Label name to look for container names (default "namespace")
      --container-pod-name-label string               Label name to look for container names (default "pod_name")
      --pod-name-label string                         Label name to look for container names (default "kubernetes_pod_name")
      --pod-namespace-label string                    Label name to look for container names (default "kubernetes_namespace")

I'm having trouble wrapping my head around when above and below options should be used:

      --metric-for-pod-labels string                  Which metric to look for pod labels in metrics (default "up{job=\"kubernetes-pods\"}")
      --pod-label-prefix string                       Which prefix to look for pod labels in metrics (default "pod_label_")

Would it be possible to provide examples and/or elaborate on all of the above?

jppitout commented 3 years ago

Here are some instances where such docs might have helped:

vertical-pod-autoscaler: how to configure for integration with prometheus? https://github.com/kubernetes/autoscaler/issues/1551
[BUG] VPA Recommender InitFromHistoryProvider not working and logs filled with "Error adding metric sample" warnings https://github.com/kubernetes/autoscaler/issues/3376
[VPA] Prometheus labels for cadvisor have changed https://github.com/kubernetes/autoscaler/issues/3439

jppitout commented 3 years ago

/remove-lifecycle rotten

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

davidquarles commented 3 years ago

Please could we get more complete documentation on setting up Prometheus as a history provider for the VPA recommender component?

For example how to customize it and verify that it is indeed working? Also including whether CPU and memory queries are customizable or not?

It would be nice if the documentation could include which jobs get queried e.g. is it the "kubernetes-nodes-cadvisor" and "kubernetes-pods", just the one, or are there more?

Running the previously recommended command:
docker run -it k8s.gcr.io/autoscaling/vpa-recommender:0.9.2 ./vpa-recommender --help
The descriptions of these options are too similar i.e. "Label name to look for container names"... are they all looking for container names (or is one looking for pod names)?... are they used in conjunction or either/or? :
      --container-name-label string                   Label name to look for container names (default "name")
      --container-namespace-label string              Label name to look for container names (default "namespace")
      --container-pod-name-label string               Label name to look for container names (default "pod_name")
      --pod-name-label string                         Label name to look for container names (default "kubernetes_pod_name")
      --pod-namespace-label string                    Label name to look for container names (default "kubernetes_namespace")
I'm having trouble wrapping my head around when above and below options should be used:
      --metric-for-pod-labels string                  Which metric to look for pod labels in metrics (default "up{job=\"kubernetes-pods\"}")
      --pod-label-prefix string                       Which prefix to look for pod labels in metrics (default "pod_label_")
Would it be possible to provide examples and/or elaborate on all of the above?

I am / we are absurdly grateful for this project and the value it provides, having used it extensively over the last few years, but after fighting the prometheus integration setup for the first time ever for awhile last night I agree with this. It is rather obtuse trying to figure out what is going on with these options and requires a detailed analysis of the underlying codebase. Even after doing so, I wasn't successful.

It also isn't super clear what happens to the existing checkpoints when migrating storage backends and what behavior one can expect to occur in this process, which is a bit scary given that we've already littered our production environment with VPA.

My fragile understanding thus far:

The cadvisor metrics are range-queried in bulk and the --container-*-label flags map to the labels in those metrics
Those metrics are matched against pod series contained in the --metric-for-pod-labels metric / query, i.e. container-namespace-label == pod-namespace-label && container-pod-name-label == pod-name-label
Additional labels found on the metric-for-pod-labels series are parsed into memory (anything prefixed by pod-label-prefix, i.e. with --pod-label-prefix=label_, label_foo="bar" => foo: bar)
I'm guessing the labels are then used to match against the VPA target's selector? I did not get that far up the callstack when stepping through the code.

Is that all accurate? I tried using the kube-state-metrics kube_pod_labels for --metric-for-pod-labels, since our prometheus config is only scraping pod's with the scrape annotation and the default up{job="kubernetes-pods"} is thus filtered, but something is still amiss and I was seeing lots of these before I gave up for the evening:

Error adding metric sample for container {{velero velero-6778d944c5-t5xqj} velero}: sample discarded (invalid or out of order)

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/3784#issuecomment-973178199): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

michaelswierszcz commented 2 years ago

Usage of /recommender:
      --add-dir-header                                If true, adds the file directory to the header
      --address string                                The address to expose Prometheus metrics. (default ":8942")
      --alsologtostderr                               log to standard error as well as files
      --checkpoints-gc-interval duration              How often orphaned checkpoints should be garbage collected (default 10m0s)
      --checkpoints-timeout duration                  Timeout for writing checkpoints since the start of the recommender's main loop (default 1m0s)
      --container-name-label string                   Label name to look for container names (default "name")
      --container-namespace-label string              Label name to look for container names (default "namespace")
      --container-pod-name-label string               Label name to look for container names (default "pod_name")
      --cpu-histogram-decay-half-life duration        The amount of time it takes a historical CPU usage sample to lose half of its weight. (default 24h0m0s)
      --history-length string                         How much time back prometheus have to be queried to get historical metrics (default "8d")
      --history-resolution string                     Resolution at which Prometheus is queried for historical metrics (default "1h")
      --kube-api-burst float                          QPS burst limit when making requests to Kubernetes apiserver (default 10)
      --kube-api-qps float                            QPS limit when making requests to Kubernetes apiserver (default 5)
      --log-backtrace-at traceLocation                when logging hits line file:N, emit a stack trace (default :0)
      --log-dir string                                If non-empty, write log files in this directory
      --log-file string                               If non-empty, use this log file
      --log-file-max-size uint                        Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --logtostderr                                   log to standard error instead of files (default true)
      --memory-aggregation-interval duration          The length of a single interval, for which the peak memory usage is computed. Memory usage peaks are aggregated in multiples of this interval. In other words there is one memory usage sample per interval (the maximum usage over that interval) (default 24h0m0s)
      --memory-aggregation-interval-count int         The number of consecutive memory-aggregation-intervals which make up the MemoryAggregationWindowLength which in turn is the period for memory usage aggregation by VPA. In other words, MemoryAggregationWindowLength = memory-aggregation-interval * memory-aggregation-interval-count. (default 8)
      --memory-histogram-decay-half-life duration     The amount of time it takes a historical memory usage sample to lose half of its weight. In other words, a fresh usage sample is twice as 'important' as one with age equal to the half life period. (default 24h0m0s)
      --memory-saver                                  If true, only track pods which have an associated VPA
      --metric-for-pod-labels string                  Which metric to look for pod labels in metrics (default "up{job=\"kubernetes-pods\"}")
      --min-checkpoints int                           Minimum number of checkpoints to write per recommender's main loop (default 10)
      --pod-label-prefix string                       Which prefix to look for pod labels in metrics (default "pod_label_")
      --pod-name-label string                         Label name to look for container names (default "kubernetes_pod_name")
      --pod-namespace-label string                    Label name to look for container names (default "kubernetes_namespace")
      --pod-recommendation-min-cpu-millicores float   Minimum CPU recommendation for a pod (default 25)
      --pod-recommendation-min-memory-mb float        Minimum memory recommendation for a pod (default 250)
      --prometheus-address string                     Where to reach for Prometheus metrics
      --prometheus-cadvisor-job-name string           Name of the prometheus job name which scrapes the cAdvisor metrics (default "kubernetes-cadvisor")
      --prometheus-query-timeout string               How long to wait before killing long queries (default "5m")
      --recommendation-margin-fraction float          Fraction of usage added as the safety margin to the recommended request (default 0.15)
      --recommender-interval duration                 How often metrics should be fetched (default 1m0s)
      --skip-headers                                  If true, avoid header prefixes in the log messages
      --skip-log-headers                              If true, avoid headers when opening log files
      --stderrthreshold severity                      logs at or above this threshold go to stderr (default 2)
      --storage string                                Specifies storage mode. Supported values: prometheus, checkpoint (default)
  -v, --v Level                                       number for the log level verbosity
      --vmodule moduleSpec                            comma-separated list of pattern=N settings for file-filtered logging
      --vpa-object-namespace string                   Namespace to search for VPA objects and pod stats. Empty means all namespaces will be used.

jianlong0808 commented 6 months ago

I think this source code can explain

kubernetes / autoscaler

[vpa] Documentation for configuration options #3784