Metric names between OpenCensus & Prometheus exporters are different

dprotaso commented 3 years ago

/area monitoring /kind cleanup

We currently prepend a 'domain' and the 'component' name to the metrics. This done for our OpenCensus exporter which might have been cargo cult'd from our stack driver exporter implementation. The prometheus exporter only prefixes the 'component' name to the metric.

https://github.com/knative/pkg/blob/cf1994e9cc24db63cbc8a21e1a84152b732ad3f0/metrics/opencensus_exporter.go#L39

We should determine whether these should be consistent, if there should be a prefix and it's format

cc @upodroid @skonto @evankanderson @MontyCarter @pianohacker

dprotaso commented 3 years ago

From @pianohacker re: prepending a domain prefix for prometheus https://knative.slack.com/archives/CEPGNPHP1/p1624558123099100

This level of namespacing on the metric name isn't typically done for Prometheus metrics; see https://prometheus.io/docs/practices/naming/ The source (as opposed to the kind) of a Prometheus-exposed metric is intended to be distinguished by its labels (in this case, namespace). In that vein, could we expose the ability to configure one or more labels on the exposed metrics?

dprotaso commented 3 years ago

cc @evankanderson who reviewed the OpenCensus exporter changes

skonto commented 3 years ago

Regarding Prometheus this is what I read and what we have does not seem to be a violation of the recommendations unless I misread something:

...should have a (single-word) application prefix relevant to the domain the metric belongs to. The prefix is sometimes referred to as namespace by client libraries. For metrics specific to an application, the prefix is usually the application name itself. Sometimes, however, metrics are more generic, like standardized metrics exported by client libraries. Examples:
prometheus_notifications_total (specific to the Prometheus server)
process_cpu_seconds_total (exported by many client libraries)
http_request_duration_seconds (for all HTTP requests)

I am not sure what value the domain adds here we already use the component name as the prefix and we know the domain for each component, it is straightforward for the control plane ones eg. Serving, Eventing. Maybe there is some value for the user services to be able to correlate queue proxy metrics with some domain eg. my_business_domain but then Prom queries will have to detect that with some regex and does not make life easier (especially if we have arbitrary domains as prefixes) eg. for building generic dashboards. Also not though having queue_proxy as a metric prefix works well so far. However, there is a valid need to be able to configure existing/add custom labels eg. tag metrics with special labels to avoid joining metrics info based on kube-state-metrics or other sources. I am facing this downstream as I would like to tag metrics with the func tag for Boson funcs, as not all services are functions. What issues do we see with the current status? Are there users of the Opencensus exporter that benefit from the extra domain info?

dprotaso commented 3 years ago

@skonto @evankanderson during the operation wg meeting mentioned you wrote a doc about stable metric names and a potential deprecation policy. Do you have a link to that?

Whatever we decide with this change we'll want to roll it out gradually with a flag.

evankanderson commented 3 years ago

The domain name is used today by at least Stackdriver, though they could set up a mapping to enrich the content:

https://cloud.google.com/monitoring/api/metrics_knative

While Prometheus tends to prefer short, non-namespaced metric names, some other products like Stackdriver and Wavefront tend to prefer somewhat more nested names (deeply for stackdriver, medium for Wavefront). Based on the principle that it's easier to remove data than to add it, I incorporated the domain and component labels in OpenCensus, though I'd be happy with an alternate mechanism that didn't discard these labels.

Note that there are some metrics-exporting components from both Serving and Eventing that run in a user's namespace, so having some mechanism (metric name or labels) to distinguish metrics between the two would be useful. At the same time, common metrics like go GC information should have the same metric name for both.

Maybe a few examples from our existing documentation:

metric (opencensus name)	type	prefix reason
serving/activator/request_count	HTTP request count	Prefix with serving component, because activator request count may be double-counted with e.g. queue-proxy counts
serving/revision/request_count	HTTP request count	Prefix with revision; counted at the queue-proxy. May double-count with activator.
go_alloc	go memory stats	No prefix, common go instrumentation across components.
eventing/broker/event_count	events received count	Prefix with eventing component, because events are counted at several places in the system.
eventing/broker/request_count	HTTP request count	Prefix with eventing component; not all HTTP requests may result in an event receipt.
eventing/trigger/event_count	Post-filter event delivery counts	Prefix with eventing component, because events are counted at several places in the system.
eventing/trigger/http_requests	Post-filter HTTP request counts	Prefix with eventing component; not all HTTP requests may

MontyCarter commented 3 years ago

Yes, Stackdriver is using the domain, and the domain is different based on the component. However, the metric domains are pretty static, so it would be manageable keep a map in an otel collector config and prepend the domain to the metric name based on the sender.

With that said, I think it would be much nicer to pass the domain as a label rather than simply removing the domain from the metric. This would make it easy to rewrite the name from the domain label in an otel collector config.

evankanderson commented 3 years ago

Summary of discussion from 6 July:

We're going to write up a plan in metrics/README.md for the desired future state
Metric renames will happen probably behind a ConfigMap flag, and we'll announce them ahead-of-time to knative-users so people can pre-set the ConfigMap value.
We're going to extract the component and domain parts into separate labels for OTel support behind the flag, so the short metric names match the prometheus ones. (We should also look at the standardization going on in https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/metrics/semantic_conventions, particularly http-metrics.md).

dprotaso commented 3 years ago

Direct link to the meeting notes (2021/07/06)

dprotaso commented 3 years ago

We should also look at the standardization going on in https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/metrics/semantic_conventions, particularly http-metrics.md).

Probably would be part of https://github.com/knative/pkg/issues/855

dprotaso commented 3 years ago

/assign @dprotaso

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

upodroid commented 2 years ago

/remove-lifecycle stale

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

dprotaso commented 2 years ago

/reopen /lifecycle frozen

knative / pkg

Metric names between OpenCensus & Prometheus exporters are different #2174