knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.53k stars 1.15k forks source link

metrics.reporting-period-seconds doesn't works #15435

Open vividcloudpark opened 1 month ago

vividcloudpark commented 1 month ago

What version of Knative?

1.14.1

Expected Behavior

https://knative.dev/docs/serving/observability/metrics/collecting-metrics/#understanding-the-collector https://knative.dev/docs/serving/services/service-metrics/#exposing-queue-proxy-metrics

Per this Article, I expected each metric is going to report as 30s interval when i set metrics.reporting-period-seconds as 30s on config-observability even prometheus scrape time set to 10s.

Actual Behavior

prometheus' value changed on 10s interval, (if config works, value should be change by 30s interval) even when i set prometheus scrape time to 25s, it's interval goes to 25s.

i restart both deploy autoscaler, activator, it doesn't works. it looks like metrics.request-metrics-reporting-period-seconds doesn't work.

Steps to Reproduce the Problem

set config-observability as below

  metrics.reporting-period-seconds: "30"
  metrics.request-metrics-reporting-period-seconds: "30"

set prometheus,yaml as below

    - job_name: activator
      scrape_interval: 25s
      scrape_timeout: 10s
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_label_app, __meta_kubernetes_pod_container_port_name]
        action: keep
        regex: knative-serving;activator;metrics
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        target_label: service
skonto commented 1 month ago

Hi @vividcloudpark

Initially reporting-period was fixed for the OpenTelemetry collector (push model, see https://github.com/knative/serving/pull/14019). I think we need to update the docs as for the Prometheus exporter this has no effect although reporting period (metrics.request-metrics-reporting-period-seconds) is set correctly. Keep in mind that we use opencensus and unfortuntely the library is now archived. Also i am not so sure if it makes sense to scrape a pod multiple times when you know that the metrics are not being update (scrape period << reporting period). Also note here that metrics.request-metrics-reporting-period-seconds was meant to configure QP only unlike metrics.reporting-period-seconds which is meant for all the other components.

Now here is why reporting period for Prometheus has no effect. When reporting period is changed in the exporter this is set here: https://github.com/knative/serving/blob/main/vendor/go.opencensus.io/stats/view/worker_commands.go#L178-L185 I verified that part. So then every that period the worker will try to export metrics: https://github.com/knative/serving/blob/main/vendor/go.opencensus.io/stats/view/worker.go#L296. ReportUsage will call reportView and then exportView will be called: https://github.com/knative/serving/blob/main/vendor/go.opencensus.io/stats/view/worker.go#L376. Prometheus and the ocagent (used with the OpenTelemetry collector) exporters have different implementations. The ocagent does ship the metrics: https://github.com/knative/serving/blob/main/vendor/contrib.go.opencensus.io/exporter/ocagent/ocagent.go#L436 while the prometheus one does not export anything because of the pull model approach: https://github.com/knative/serving/blob/main/vendor/contrib.go.opencensus.io/exporter/prometheus/prometheus.go#L102-L110

// Deprecated: in lieu of metricexport.Reader interface.
func (e *Exporter) ExportView(vd *view.Data) {
}

Now when we create the Prometheus exporter we do use the reader interface but that does not do anything besides being called in order to export all metrics at any given time an http is made eg. Prometheus scraping. Read more here https://github.com/knative/serving/blob/main/vendor/contrib.go.opencensus.io/exporter/prometheus/prometheus.go#L137-L139.

// Collect is invoked every time a prometheus.Gatherer is run
// for example when the HTTP endpoint is invoked by Prometheus.
func (c *collector) Collect(ch chan<- prometheus.Metric) {
    me := &metricExporter{c: c, metricCh: ch}
    c.reader.ReadAndExport(me)
}

Note: There is an IntervalReader interface that calls ReadAndExport (by default every minute) https://github.com/knative/serving/blob/main/vendor/go.opencensus.io/metric/metricexport/reader.go#L148 but the Prometheus exporter provided by the Opencensus lib does not use it https://github.com/knative/serving/blob/main/vendor/contrib.go.opencensus.io/exporter/prometheus/prometheus.go#L148 and it uses the simple one as it relies on the HTTP call to report metrics.