Naming scheme for metrics

objectiser commented 7 years ago

The current idea for a naming scheme is to align with the OpenTracing concept of a Span, and enable metrics to be reported on a per Span basis.

The metric types will be:

Span Count - the number of spans reported, defined as a Counter in Prometheus
Span Duration - the duration of the reported spans, defined as a Histogram in Prometheus

To further categorize the individual sampled metrics, labels will be attached for the following standard fields:

operation - the operation reported with the span
span.kind - the span kind, which will be used to distinguish whether the metric represents a client or server (in RPC sense) and producer/consumer (in messaging context).
error - by default will use the boolean value

The proposed mechanism will enable the application to:

add custom labels, such as one for 'service' to report the service name when relevant for the Tracer, or to define the 'transport' used if multiple transports can be used to invoke the same endpoint - these values will either be constants or derived from tags/baggage
override any of the standard metrics listed above - allowing (for example) the error boolean value to be replaced by a more comprehensive value that identifies the nature of the error - e.g. "4xx" representing a HTTP status code in the 400-499 range.

yurishkuro commented 7 years ago

One issue we ran into when trying to define standard dashboard template based on Jaeger's RPC metrics (currently in Go client) was in categorizing errors. The yarpc approach is to separate success from client's fault errors and server's fault errors. I am not sold on the client/server distinction as I don't think in general we have enough data in the span to do that. But separating success from error is interesting because success metrics (counts + latency) generally require no additional tags, while error metrics can always be tagged with the type of error. Which works fine until we get to HTTP, which (a) has too high cardinality for errors (eg 400-599), and (b) also has sub-division of success metrics via 200s vs. 300s distinction, which people often like to see in the charts.

So the point is, when thinking about the naming scheme for the metrics, let's also think about what kind of standard dashboard can be built from them (ideally a small set of well-parameterized panels).

jotak commented 7 years ago

Hi,

It seems that you can query with regexp in prometheus, so that the high cardinality can be worked around : Ex. to get all 4xx: span_duration{status~="^4..$"}. I'm not expert with prometheus but it's described in the doc, so I guess it could be used for such situation when building dashboards. So, you could tag with the accurate http status. Maybe even without distinguishing success and errors?

One thing I also often found useful while building dashboards (for Hawkular, but Prometheus is similar) is to be able to discriminate metrics by host name. Think for instance about a pod in kubernetes that is scaled up and you'd want to focus on a specific one. Filtering on hostname allows that, but I don't know if it's also relevant in opentracing context or if you've already have that kind of discrimination at another level.

Also, a small remark, I think histograms in prometheus implicitly provide a counter so you wouldn't need 2 metrics. ( https://prometheus.io/docs/concepts/metric_types/#histogram )

objectiser commented 7 years ago

Hi Joel

Thanks for the feedback. I'll let Yuri comment on your first point.

For the host name, this can be provided as an extra label by the application. However from my experiments with prometheus in Kubernetes, it automatically adds the pod.

Good point about the histogram - that would be better to just use the one metric type.

Regards Gary

On Wed, Jun 14, 2017 at 11:19 AM, Joel Takvorian notifications@github.com wrote:

Hi,

It seems that you can query with regexp in prometheus, so that the high cardinality can be worked around : Ex. to get all 4xx: span_duration{status~="^4..$"}. I'm not expert with prometheus but it's described in the doc, so I guess it could be used for such situation when building dashboards. So, you could tag with the accurate http status. Maybe even without distinguishing success and errors?

One thing I also often found useful while building dashboards (for Hawkular, but Prometheus is similar) is to be able to discriminate metrics by host name. Think for instance about a pod in kubernetes that is scaled up and you'd want to focus on a specific one. Filtering on hostname allows that, but I don't know if it's also relevant in opentracing context or if you've already have that kind of discrimination at another level.

Also, a small remark, I think histograms in prometheus implicitly provide a counter so you wouldn't need 2 metrics. ( https://prometheus.io/docs/ concepts/metric_types/#histogram )

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/opentracing-contrib/java-metrics/issues/1#issuecomment-308389213, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKC0njEuQQVxe9eCTS7JkKcDoB6w2I-ks5sD7PEgaJpZM4N4syL .

objectiser commented 7 years ago

Updated the PR to just use the histogram and it seems to be fine:

span_bucket{operation="GET",span_kind="server",error="false",le="0.005",} 1.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.01",} 1.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.025",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.05",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.075",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.1",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.25",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.5",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.75",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="1.0",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="2.5",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="5.0",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="7.5",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="10.0",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="+Inf",} 2.0 span_count{operation="GET",span_kind="server",error="false",} 2.0 span_sum{operation="GET",span_kind="server",error="false",} 0.010556 span_bucket{operation="GET",span_kind="client",error="false",le="0.005",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.01",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.025",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.05",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.075",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.1",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.25",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.5",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.75",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="1.0",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="2.5",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="5.0",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="7.5",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="10.0",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="+Inf",} 1.0 span_count{operation="GET",span_kind="client",error="false",} 1.0 span_sum{operation="GET",span_kind="client",error="false",} 0.117374

yurishkuro commented 7 years ago

Re regexp - yes, it's possible, but not all metrics systems support that. We should not make a design decision based purely on the capabilities of Prometheus. Also, when this runs at scale, the rationing of metrics is fairly common in the enterprise (e.g. see recent Monitorama talk by Netflix, we have the same story at Uber), so there's a sizable difference between two tag values (4xx, 5xx) and 200 (400-599). I think ideally this lib should have a config option to control whether the user wants exact status codes in the tag, or summary labels 2xx, 3xx, 4xx, 5xx.

Re count vs. historgram - again, this is a feature of Prometheus, but not of all metrics backends. Having an explicit request count metric is better, imo.

@objectiser in your example, one problem I have is with the error=false tag. Using such tag allows a single metric for successes and errors, but the errors should have another dimension of "error.kind" (in case of HTTP it could be the status code), and "error.kind" does not apply to successes.

objectiser commented 7 years ago

@yurishkuro The current PR does provide a means to supply different ways of deriving a label value - so I think it would be easy enough to provide some "out of the box" options to support exact or summary codes.

So unless any objections, we will add another standard label of error.kind - but the actual way the value is derived will be configurable, discussed in a separate issue/PR.

About count vs histogram - this is purely a prometheus impl detail - to avoid duplication. However conceptually there is still two metric types, a count and histogram.

opentracing-contrib / java-metrics

Naming scheme for metrics #1