opentracing-contrib / java-metrics

Apache License 2.0
31 stars 9 forks source link

Naming scheme for metrics #1

Open objectiser opened 7 years ago

objectiser commented 7 years ago

The current idea for a naming scheme is to align with the OpenTracing concept of a Span, and enable metrics to be reported on a per Span basis.

The metric types will be:

To further categorize the individual sampled metrics, labels will be attached for the following standard fields:

The proposed mechanism will enable the application to:

yurishkuro commented 7 years ago

One issue we ran into when trying to define standard dashboard template based on Jaeger's RPC metrics (currently in Go client) was in categorizing errors. The yarpc approach is to separate success from client's fault errors and server's fault errors. I am not sold on the client/server distinction as I don't think in general we have enough data in the span to do that. But separating success from error is interesting because success metrics (counts + latency) generally require no additional tags, while error metrics can always be tagged with the type of error. Which works fine until we get to HTTP, which (a) has too high cardinality for errors (eg 400-599), and (b) also has sub-division of success metrics via 200s vs. 300s distinction, which people often like to see in the charts.

So the point is, when thinking about the naming scheme for the metrics, let's also think about what kind of standard dashboard can be built from them (ideally a small set of well-parameterized panels).

jotak commented 7 years ago

Hi,

It seems that you can query with regexp in prometheus, so that the high cardinality can be worked around : Ex. to get all 4xx: span_duration{status~="^4..$"}. I'm not expert with prometheus but it's described in the doc, so I guess it could be used for such situation when building dashboards. So, you could tag with the accurate http status. Maybe even without distinguishing success and errors?

One thing I also often found useful while building dashboards (for Hawkular, but Prometheus is similar) is to be able to discriminate metrics by host name. Think for instance about a pod in kubernetes that is scaled up and you'd want to focus on a specific one. Filtering on hostname allows that, but I don't know if it's also relevant in opentracing context or if you've already have that kind of discrimination at another level.

Also, a small remark, I think histograms in prometheus implicitly provide a counter so you wouldn't need 2 metrics. ( https://prometheus.io/docs/concepts/metric_types/#histogram )

objectiser commented 7 years ago

Hi Joel

Thanks for the feedback. I'll let Yuri comment on your first point.

For the host name, this can be provided as an extra label by the application. However from my experiments with prometheus in Kubernetes, it automatically adds the pod.

Good point about the histogram - that would be better to just use the one metric type.

Regards Gary

On Wed, Jun 14, 2017 at 11:19 AM, Joel Takvorian notifications@github.com wrote:

Hi,

It seems that you can query with regexp in prometheus, so that the high cardinality can be worked around : Ex. to get all 4xx: span_duration{status~="^4..$"}. I'm not expert with prometheus but it's described in the doc, so I guess it could be used for such situation when building dashboards. So, you could tag with the accurate http status. Maybe even without distinguishing success and errors?

One thing I also often found useful while building dashboards (for Hawkular, but Prometheus is similar) is to be able to discriminate metrics by host name. Think for instance about a pod in kubernetes that is scaled up and you'd want to focus on a specific one. Filtering on hostname allows that, but I don't know if it's also relevant in opentracing context or if you've already have that kind of discrimination at another level.

Also, a small remark, I think histograms in prometheus implicitly provide a counter so you wouldn't need 2 metrics. ( https://prometheus.io/docs/ concepts/metric_types/#histogram )

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/opentracing-contrib/java-metrics/issues/1#issuecomment-308389213, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKC0njEuQQVxe9eCTS7JkKcDoB6w2I-ks5sD7PEgaJpZM4N4syL .

objectiser commented 7 years ago

Updated the PR to just use the histogram and it seems to be fine:

span_bucket{operation="GET",span_kind="server",error="false",le="0.005",} 1.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.01",} 1.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.025",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.05",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.075",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.1",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.25",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.5",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="0.75",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="1.0",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="2.5",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="5.0",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="7.5",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="10.0",} 2.0 span_bucket{operation="GET",span_kind="server",error="false",le="+Inf",} 2.0 span_count{operation="GET",span_kind="server",error="false",} 2.0 span_sum{operation="GET",span_kind="server",error="false",} 0.010556 span_bucket{operation="GET",span_kind="client",error="false",le="0.005",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.01",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.025",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.05",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.075",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.1",} 0.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.25",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.5",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="0.75",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="1.0",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="2.5",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="5.0",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="7.5",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="10.0",} 1.0 span_bucket{operation="GET",span_kind="client",error="false",le="+Inf",} 1.0 span_count{operation="GET",span_kind="client",error="false",} 1.0 span_sum{operation="GET",span_kind="client",error="false",} 0.117374

yurishkuro commented 7 years ago

Re regexp - yes, it's possible, but not all metrics systems support that. We should not make a design decision based purely on the capabilities of Prometheus. Also, when this runs at scale, the rationing of metrics is fairly common in the enterprise (e.g. see recent Monitorama talk by Netflix, we have the same story at Uber), so there's a sizable difference between two tag values (4xx, 5xx) and 200 (400-599). I think ideally this lib should have a config option to control whether the user wants exact status codes in the tag, or summary labels 2xx, 3xx, 4xx, 5xx.

Re count vs. historgram - again, this is a feature of Prometheus, but not of all metrics backends. Having an explicit request count metric is better, imo.

@objectiser in your example, one problem I have is with the error=false tag. Using such tag allows a single metric for successes and errors, but the errors should have another dimension of "error.kind" (in case of HTTP it could be the status code), and "error.kind" does not apply to successes.

objectiser commented 7 years ago

@yurishkuro The current PR does provide a means to supply different ways of deriving a label value - so I think it would be easy enough to provide some "out of the box" options to support exact or summary codes.

So unless any objections, we will add another standard label of error.kind - but the actual way the value is derived will be configurable, discussed in a separate issue/PR.

About count vs histogram - this is purely a prometheus impl detail - to avoid duplication. However conceptually there is still two metric types, a count and histogram.