Closed abesto closed 7 years ago
On second thought, maybe this doesn't satisfy the original intention, ie. to show what these metrics mean. A good way to solve that could be to release a Zipkin dashboard on https://grafana.net/dashboards.
unless both are required in one step, probably good to merge this (as at least it helps people understand and test the integration of metrics regardless of how they are interpreted)
While this should work fine without a dashboard, I went ahead and drafted a dashboard that includes all the currently exported data. I tried to organize it to the best of my ability: https://grafana.net/dashboards/1598 (currently draft, meaning it's viewable via this link, but doesn't show up in searches. At least I think that's what it means)
I also created an OpenZipkin org on grafana.net so that we can share write access: https://grafana.net/openzipkin. Drop me your grafana.net username to get access.
So I'd say: let's do one round of iteration on the dashboard (either by review or direct edit). I'll go ahead and update the README accordingly; once done, we can release the dashboard and merge this.
@kristofa @klette what are your thoughts on this?
@abesto Looks good! Could we pre-configure the grafana docker container with the Zipkin dashboard you made on grafana.net? I can have a look at the dashboard. My grafana.net username is kristofa.
@kristofa Initially I thought the only way to do that would be to commit the database file into the repo, but on second reading there may be a better way (dashboards.json
config value). Checking into that now. Added you to the grafana.net org.
Update: that doesn't work for the dashboard format used by grafana.net, and it doesn't support adding data sources. Created a small shell script to set up the data source and the dashboard on startup (pulling the dashboard from grafana.net)
I just signed up in grafana as adriancole
On Tue, Feb 28, 2017 at 2:43 AM, Zoltán Nagy notifications@github.com wrote:
While this should work fine without a dashboard, I went ahead and drafted a dashboard that includes all the currently exported data. I tried to organize it to the best of my ability: https://grafana.net/dashboards/1598 (currently draft, meaning it's viewable via this link, but doesn't show up in searches. At least I think that's what it means)
I also created an OpenZipkin org on grafana.net so that we can share write access: https://grafana.net/openzipkin. Drop me your grafana.net username to get access.
So I'd say: let's do one round of iteration on the dashboard (either by review or direct edit). I'll go ahead and update the README accordingly; once done, we can release the dashboard and merge this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzipkin/docker-zipkin/pull/135#issuecomment-282830107, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61zSLtxFeOopJxqz71knKipvqJUP1ks5rgydrgaJpZM4MMWtf .
@adriancole added you on grafana.net to the org.
I noticed we don't seem to expose any metrics that indicate failures in the dashboard.
Don't we have metrics that for example expose http 5xx responses?
After I invoked some requests to zipkin-web I noticed new metrics appeared in Prometheus indicating the number of requests / endpoint (eg status_200_api_v1_services
). Do we have status_5xx_api_v2_services
or similar? That would be useful.
Also, the status_200_...
metrics are more interesting compared to the http_sessionsactive metric IMHO because they are grouped by endpoint. The `statusmetrics are counters so that means we'll have to use
rate` to get the req/sec.
If we want to make the dashboard production ready we might also sum
the metrics to show the sum over all instances. For the docker container case that doesn't really matter since we only have 1 instance running at port 9411 but in production we are likely to have more instances. For example a graph showing the successful requests / sec to the api_v1_services endpoint for all service instances might use following Prometheus expression:
sum(rate(status_200_api_v1_services{job="zipkin"}[1m]))
@kristofa I'm pretty sure we can make our api 500 :) (then answer the question about metrics path). It is standard spring boot metrics.. At any rate, we ought to think about testing this integration at some point. For example, we have some docker tests already for storage. It would be neat to have a test containers .. test.. to ensure whatever we say here actually works moving forward.
cc @shakuzen @bsideup
@adriancole shouldn't be hard :) I'm a bit out of a context, but let me know if you need any help to configure TestContainers for it :)
@kristofa Totally agree on the status_.*
metrics, not sure how I missed them. I went digging and learned that, for the response times at least, we can use a query like {__name__=~"^response_.*", job="zipkin"}
- this will let the dashboard pick up new response time metrics as they're added. Unfortunately when aggregation is applied, it seems to merge the metrics - for instance rate({__name__=~"^status_200_.*", job="zipkin"}[1m])
is a single metric. Am I missing something? Do we really have to hard-code the endpoints into the dashboard? (Also played around with Grafana templating, but the end result is effectively the same).
Side-note on doing a sum
on top of the rate
: generally speaking, I prefer having individual metrics, and setting Grafana to stack the appropriate values. This provides the same value at the "top" of the stack, but more detail is immediately visible (say, if one node in the cluster is misbehaving). In this case however having both the "node" and the "endpoint" axis is not feasible, and endpoint is the more important one. Maybe this is a case where Grafana templating can help (set up a query variable for nodes?)
@abesto We stepped away from having metric names which include dimensions that are interested for filtering like statusClass and use labels instead. So the status_200_api_v1_services
metric could instead look like status_api_total
with labels version
, statusClass
, path
and so you could filter by specifying: status_api_total{version="v1", statusClass="2xx", path="services"}
. We learned from the Prometheus developers that this is a nicer way to define metrics. Also new endpoints would automatically show up as values for the path
label.
We don't typically do aggregation at the client / grafana side and we often define Prometheus recording rules. These are pre-calculated at the server side to prevent that expensive and often used queries are calculated for every client request. Triggering expensive queries and getting more detailed data to the client might work depending on how many time series your Prometheus server has to maintain and serve. But indeed, the more you aggregate the more likely it is you have to invoke a separate query to get more details for example to find a single bad behaving instance.
@kristofa So the way Zipkin exposes metrics currently is not following the established best practices of the Prometheus community / devs. Thanks for educating!
Let me test my understanding. Even after we restructure the metrics as you described (for instance, to have one metric called http_response_status_count
with labels version
, path
, and statusClass
), we'll still hit the same problem on rate calculation, right? After the refactor, the query would be something like sum(rate(http_response_status_count))
, which will still become a single metric (same as the old query using __name__
). So if we want to both 1. keep the metric type a gauge and 2. not update the Grafana dashboard each time a new endpoint is added, we'd need to set up a recording rule on the Prometheus side, something like http_response_status_count_rate = sum(rate(http_response_status_count[1m])) by (host)
? And then in Grafana the query will be just http_response_status_count_rate
, showing one (per-minute rate) metric per HTTP endpoint (with new endpoints showing up automatically).
Is that correct? Is this, do you think, the idiomatic way to approach this?
Sorry for the late response.
We can use an expression like sum(rate(http_response_status_count[1m])) by (path, statusClass)
. When using this you will get a new entry in your graph when a new path is added without having to update the graph definition. In my opinion this also is more readable compared to using expressions like {__name__=~"^response_.*"}
.
The rate function calculates the per-second average rate. See [rate explanation](https://prometheus.io/docs/querying/functions/#rate()).
The recording rules I talked about make sense when the expression is expensive to calculate and so the rule moves the calculation to the server side where it will be calculated at scheduled intervals and avoid calculation for every client request.
Here are the Prometheus metric / label name conventions: Metric and label naming
I didn't put one and one together, thanks for explaining things to a Prometheus newbie! I've just understood how the whole aggregation / by
clause business works. Totally agree on the way using labels being more readable. (I tried to get a reasonably meaningful HTTP status code graph given the current metrics, but by (name)
and by (__name__)
don't seem to work. Which is fine, we need to do the Right Thing anyway).
I see two roads ahead at this point, looking mostly to @adriancole for advice:
http_status{path="/api/v1/services"}
and status_200_api_v1_services
.I would get the dashboard out as it is now. It is already useful to show the integration and existing metrics. We can always iterate later.
fyi I noticed that since we added prometheus to zipkin, upstream formalized it.. maybe we should consider their metrics endpoint before formalizing this (or maybe we do it after) https://github.com/openzipkin/zipkin/pull/1144#issuecomment-288762140
Rebased on top of master.
Re. replacing metrics collection with the upstream client: that won't change the format of the metrics exposed; the official client pretty much does the same as what our current exporter does. It does add some new metrics, with which we can extend the dashboard later (see https://github.com/openzipkin/zipkin/pull/1609)
We can also do some more magic on the Prometheus server side to get nicer response count metrics, will try to do that now (see https://github.com/prometheus/client_java/pull/255#issuecomment-307587314)
With the relabeling rules the dashboard can now be significantly smarter, with automatically populated response code count and response time graphs. Updated the dashboard on grafana.net, looks something like this:
@kristofa @adriancole Some time has passed, and there are new changes. I think this is ready to merge, waiting for a nod from you :)
Added message count (received and dropped), spans received count, and bytes received graphs to the dashboard. They automatically pick up transports (see the new relabeling rules).
After updating the rewrite rules for the changes in openzipkin/zipkin#1609, with https://gist.github.com/abesto/642cd049cc75643213b6e4c23bad7734, here's the current state. Things to note:
metric{instance=...}
)very nice.. going out for zipkin 2.2
Cool!
⚠️ Watch out: the version of the dashboard currently on grafana.com works with the pre-Spring metrics. Once 2.2 is released, https://grafana.com/dashboards/1598/ needs to be updated with the JSON in https://gist.github.com/abesto/642cd049cc75643213b6e4c23bad7734.
⚠️ Watch out: the version of the dashboard currently on grafana.com works with the pre-Spring metrics. Once 2.2 is released, https://grafana.com/ dashboards/1598/ needs to be updated with the JSON in https://gist.github.com/abesto/642cd049cc75643213b6e4c23bad7734.
uploaded!
Add Prometheus, Grafana to
docker-compose.yml
. Document basic usage and setup. See #121 for reasoning.I attempted to "ship" with a state where Grafana is already configured, but it's not completely trivial (where do you store the initial state? How do you get it to Grafana) leading to complexity. It would also remove some of the learning we expect to provide - automating it here means users will need to re-discover it later.