knative / pkg

Knative common packages
Apache License 2.0
255 stars 330 forks source link

Metric names between OpenCensus & Prometheus exporters are different #2174

Open dprotaso opened 3 years ago

dprotaso commented 3 years ago

/area monitoring /kind cleanup

We currently prepend a 'domain' and the 'component' name to the metrics. This done for our OpenCensus exporter which might have been cargo cult'd from our stack driver exporter implementation. The prometheus exporter only prefixes the 'component' name to the metric.

https://github.com/knative/pkg/blob/cf1994e9cc24db63cbc8a21e1a84152b732ad3f0/metrics/opencensus_exporter.go#L39

We should determine whether these should be consistent, if there should be a prefix and it's format

cc @upodroid @skonto @evankanderson @MontyCarter @pianohacker

dprotaso commented 3 years ago

From @pianohacker re: prepending a domain prefix for prometheus https://knative.slack.com/archives/CEPGNPHP1/p1624558123099100

This level of namespacing on the metric name isn't typically done for Prometheus metrics; see https://prometheus.io/docs/practices/naming/ The source (as opposed to the kind) of a Prometheus-exposed metric is intended to be distinguished by its labels (in this case, namespace). In that vein, could we expose the ability to configure one or more labels on the exposed metrics?

dprotaso commented 3 years ago

cc @evankanderson who reviewed the OpenCensus exporter changes

skonto commented 3 years ago

Regarding Prometheus this is what I read and what we have does not seem to be a violation of the recommendations unless I misread something:

...should have a (single-word) application prefix relevant to the domain the metric belongs to. The prefix is sometimes referred to as namespace by client libraries. For metrics specific to an application, the prefix is usually the application name itself. Sometimes, however, metrics are more generic, like standardized metrics exported by client libraries. Examples:
prometheus_notifications_total (specific to the Prometheus server)
process_cpu_seconds_total (exported by many client libraries)
http_request_duration_seconds (for all HTTP requests)

I am not sure what value the domain adds here we already use the component name as the prefix and we know the domain for each component, it is straightforward for the control plane ones eg. Serving, Eventing. Maybe there is some value for the user services to be able to correlate queue proxy metrics with some domain eg. my_business_domain but then Prom queries will have to detect that with some regex and does not make life easier (especially if we have arbitrary domains as prefixes) eg. for building generic dashboards. Also not though having queue_proxy as a metric prefix works well so far. However, there is a valid need to be able to configure existing/add custom labels eg. tag metrics with special labels to avoid joining metrics info based on kube-state-metrics or other sources. I am facing this downstream as I would like to tag metrics with the func tag for Boson funcs, as not all services are functions. What issues do we see with the current status? Are there users of the Opencensus exporter that benefit from the extra domain info?

dprotaso commented 3 years ago

@skonto @evankanderson during the operation wg meeting mentioned you wrote a doc about stable metric names and a potential deprecation policy. Do you have a link to that?

Whatever we decide with this change we'll want to roll it out gradually with a flag.

evankanderson commented 3 years ago

The domain name is used today by at least Stackdriver, though they could set up a mapping to enrich the content:

https://cloud.google.com/monitoring/api/metrics_knative

While Prometheus tends to prefer short, non-namespaced metric names, some other products like Stackdriver and Wavefront tend to prefer somewhat more nested names (deeply for stackdriver, medium for Wavefront). Based on the principle that it's easier to remove data than to add it, I incorporated the domain and component labels in OpenCensus, though I'd be happy with an alternate mechanism that didn't discard these labels.

Note that there are some metrics-exporting components from both Serving and Eventing that run in a user's namespace, so having some mechanism (metric name or labels) to distinguish metrics between the two would be useful. At the same time, common metrics like go GC information should have the same metric name for both.

Maybe a few examples from our existing documentation:

metric (opencensus name) type prefix reason
serving/activator/request_count HTTP request count Prefix with serving component, because activator request count may be double-counted with e.g. queue-proxy counts
serving/revision/request_count HTTP request count Prefix with revision; counted at the queue-proxy. May double-count with activator.
go_alloc go memory stats No prefix, common go instrumentation across components.
eventing/broker/event_count events received count Prefix with eventing component, because events are counted at several places in the system.
eventing/broker/request_count HTTP request count Prefix with eventing component; not all HTTP requests may result in an event receipt.
eventing/trigger/event_count Post-filter event delivery counts Prefix with eventing component, because events are counted at several places in the system.
eventing/trigger/http_requests Post-filter HTTP request counts Prefix with eventing component; not all HTTP requests may
MontyCarter commented 3 years ago

Yes, Stackdriver is using the domain, and the domain is different based on the component. However, the metric domains are pretty static, so it would be manageable keep a map in an otel collector config and prepend the domain to the metric name based on the sender.

With that said, I think it would be much nicer to pass the domain as a label rather than simply removing the domain from the metric. This would make it easy to rewrite the name from the domain label in an otel collector config.

evankanderson commented 3 years ago

Summary of discussion from 6 July:

dprotaso commented 3 years ago

Direct link to the meeting notes (2021/07/06)

dprotaso commented 3 years ago

We should also look at the standardization going on in https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/metrics/semantic_conventions, particularly http-metrics.md).

Probably would be part of https://github.com/knative/pkg/issues/855

dprotaso commented 3 years ago

/assign @dprotaso

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

upodroid commented 2 years ago

/remove-lifecycle stale

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

dprotaso commented 2 years ago

/reopen /lifecycle frozen