Fix #121: Add example prometheus setup

abesto commented 7 years ago

Add Prometheus, Grafana to docker-compose.yml. Document basic usage and setup. See #121 for reasoning.

I attempted to "ship" with a state where Grafana is already configured, but it's not completely trivial (where do you store the initial state? How do you get it to Grafana) leading to complexity. It would also remove some of the learning we expect to provide - automating it here means users will need to re-discover it later.

abesto commented 7 years ago

On second thought, maybe this doesn't satisfy the original intention, ie. to show what these metrics mean. A good way to solve that could be to release a Zipkin dashboard on https://grafana.net/dashboards.

codefromthecrypt commented 7 years ago

unless both are required in one step, probably good to merge this (as at least it helps people understand and test the integration of metrics regardless of how they are interpreted)

abesto commented 7 years ago

While this should work fine without a dashboard, I went ahead and drafted a dashboard that includes all the currently exported data. I tried to organize it to the best of my ability: https://grafana.net/dashboards/1598 (currently draft, meaning it's viewable via this link, but doesn't show up in searches. At least I think that's what it means)

screenshot

I also created an OpenZipkin org on grafana.net so that we can share write access: https://grafana.net/openzipkin. Drop me your grafana.net username to get access.

So I'd say: let's do one round of iteration on the dashboard (either by review or direct edit). I'll go ahead and update the README accordingly; once done, we can release the dashboard and merge this.

abesto commented 7 years ago

@kristofa @klette what are your thoughts on this?

kristofa commented 7 years ago

@abesto Looks good! Could we pre-configure the grafana docker container with the Zipkin dashboard you made on grafana.net? I can have a look at the dashboard. My grafana.net username is kristofa.

abesto commented 7 years ago

@kristofa Initially I thought the only way to do that would be to commit the database file into the repo, but on second reading there may be a better way (dashboards.json config value). Checking into that now. Added you to the grafana.net org.

Update: that doesn't work for the dashboard format used by grafana.net, and it doesn't support adding data sources. Created a small shell script to set up the data source and the dashboard on startup (pulling the dashboard from grafana.net)

codefromthecrypt commented 7 years ago

I just signed up in grafana as adriancole

On Tue, Feb 28, 2017 at 2:43 AM, Zoltán Nagy notifications@github.com wrote:

While this should work fine without a dashboard, I went ahead and drafted a dashboard that includes all the currently exported data. I tried to organize it to the best of my ability: https://grafana.net/dashboards/1598 (currently draft, meaning it's viewable via this link, but doesn't show up in searches. At least I think that's what it means)

I also created an OpenZipkin org on grafana.net so that we can share write access: https://grafana.net/openzipkin. Drop me your grafana.net username to get access.

So I'd say: let's do one round of iteration on the dashboard (either by review or direct edit). I'll go ahead and update the README accordingly; once done, we can release the dashboard and merge this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzipkin/docker-zipkin/pull/135#issuecomment-282830107, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61zSLtxFeOopJxqz71knKipvqJUP1ks5rgydrgaJpZM4MMWtf .

abesto commented 7 years ago

@adriancole added you on grafana.net to the org.

kristofa commented 7 years ago

I noticed we don't seem to expose any metrics that indicate failures in the dashboard. Don't we have metrics that for example expose http 5xx responses?
After I invoked some requests to zipkin-web I noticed new metrics appeared in Prometheus indicating the number of requests / endpoint (eg status_200_api_v1_services). Do we have status_5xx_api_v2_services or similar? That would be useful.

Also, the status_200_... metrics are more interesting compared to the http_sessionsactive metric IMHO because they are grouped by endpoint. The `statusmetrics are counters so that means we'll have to userate` to get the req/sec.

If we want to make the dashboard production ready we might also sum the metrics to show the sum over all instances. For the docker container case that doesn't really matter since we only have 1 instance running at port 9411 but in production we are likely to have more instances. For example a graph showing the successful requests / sec to the api_v1_services endpoint for all service instances might use following Prometheus expression:

sum(rate(status_200_api_v1_services{job="zipkin"}[1m]))

codefromthecrypt commented 7 years ago

@kristofa I'm pretty sure we can make our api 500 :) (then answer the question about metrics path). It is standard spring boot metrics.. At any rate, we ought to think about testing this integration at some point. For example, we have some docker tests already for storage. It would be neat to have a test containers .. test.. to ensure whatever we say here actually works moving forward.

cc @shakuzen @bsideup

bsideup commented 7 years ago

@adriancole shouldn't be hard :) I'm a bit out of a context, but let me know if you need any help to configure TestContainers for it :)

abesto commented 7 years ago

@kristofa Totally agree on the status_.* metrics, not sure how I missed them. I went digging and learned that, for the response times at least, we can use a query like {__name__=~"^response_.*", job="zipkin"} - this will let the dashboard pick up new response time metrics as they're added. Unfortunately when aggregation is applied, it seems to merge the metrics - for instance rate({__name__=~"^status_200_.*", job="zipkin"}[1m]) is a single metric. Am I missing something? Do we really have to hard-code the endpoints into the dashboard? (Also played around with Grafana templating, but the end result is effectively the same).

Side-note on doing a sum on top of the rate: generally speaking, I prefer having individual metrics, and setting Grafana to stack the appropriate values. This provides the same value at the "top" of the stack, but more detail is immediately visible (say, if one node in the cluster is misbehaving). In this case however having both the "node" and the "endpoint" axis is not feasible, and endpoint is the more important one. Maybe this is a case where Grafana templating can help (set up a query variable for nodes?)

kristofa commented 7 years ago

@abesto We stepped away from having metric names which include dimensions that are interested for filtering like statusClass and use labels instead. So the status_200_api_v1_services metric could instead look like status_api_total with labels version, statusClass, path and so you could filter by specifying: status_api_total{version="v1", statusClass="2xx", path="services"}. We learned from the Prometheus developers that this is a nicer way to define metrics. Also new endpoints would automatically show up as values for the path label.

We don't typically do aggregation at the client / grafana side and we often define Prometheus recording rules. These are pre-calculated at the server side to prevent that expensive and often used queries are calculated for every client request. Triggering expensive queries and getting more detailed data to the client might work depending on how many time series your Prometheus server has to maintain and serve. But indeed, the more you aggregate the more likely it is you have to invoke a separate query to get more details for example to find a single bad behaving instance.

abesto commented 7 years ago

@kristofa So the way Zipkin exposes metrics currently is not following the established best practices of the Prometheus community / devs. Thanks for educating!

Let me test my understanding. Even after we restructure the metrics as you described (for instance, to have one metric called http_response_status_count with labels version, path, and statusClass), we'll still hit the same problem on rate calculation, right? After the refactor, the query would be something like sum(rate(http_response_status_count)), which will still become a single metric (same as the old query using __name__). So if we want to both 1. keep the metric type a gauge and 2. not update the Grafana dashboard each time a new endpoint is added, we'd need to set up a recording rule on the Prometheus side, something like http_response_status_count_rate = sum(rate(http_response_status_count[1m])) by (host)? And then in Grafana the query will be just http_response_status_count_rate, showing one (per-minute rate) metric per HTTP endpoint (with new endpoints showing up automatically).

Is that correct? Is this, do you think, the idiomatic way to approach this?

kristofa commented 7 years ago

Sorry for the late response.

We can use an expression like sum(rate(http_response_status_count[1m])) by (path, statusClass). When using this you will get a new entry in your graph when a new path is added without having to update the graph definition. In my opinion this also is more readable compared to using expressions like {__name__=~"^response_.*"}.

The rate function calculates the per-second average rate. See [rate explanation](https://prometheus.io/docs/querying/functions/#rate()).

The recording rules I talked about make sense when the expression is expensive to calculate and so the rule moves the calculation to the server side where it will be calculated at scheduled intervals and avoid calculation for every client request.

Here are the Prometheus metric / label name conventions: Metric and label naming

abesto commented 7 years ago

I didn't put one and one together, thanks for explaining things to a Prometheus newbie! I've just understood how the whole aggregation / by clause business works. Totally agree on the way using labels being more readable. (I tried to get a reasonably meaningful HTTP status code graph given the current metrics, but by (name) and by (__name__) don't seem to work. Which is fine, we need to do the Right Thing anyway).

I see two roads ahead at this point, looking mostly to @adriancole for advice:

We can merge this PR as is, get the basic example out there. Then restructure the Prometheus metrics whenever we get around to it (breaking the current dashboard), and update the dashboard on graphana.net. Pro: example is out there ASAP. Con: dashboards will break, unless we double-publish metrics as both http_status{path="/api/v1/services"} and status_200_api_v1_services.
We can put this PR on hold until we get around to restructuring the Prometheus metrics. Pro: the first released dashboard will be kickass, no compatibility problems. Con: delays releasing the example.

kristofa commented 7 years ago

I would get the dashboard out as it is now. It is already useful to show the integration and existing metrics. We can always iterate later.

codefromthecrypt commented 7 years ago

fyi I noticed that since we added prometheus to zipkin, upstream formalized it.. maybe we should consider their metrics endpoint before formalizing this (or maybe we do it after) https://github.com/openzipkin/zipkin/pull/1144#issuecomment-288762140

abesto commented 7 years ago

Rebased on top of master.

Re. replacing metrics collection with the upstream client: that won't change the format of the metrics exposed; the official client pretty much does the same as what our current exporter does. It does add some new metrics, with which we can extend the dashboard later (see https://github.com/openzipkin/zipkin/pull/1609)

We can also do some more magic on the Prometheus server side to get nicer response count metrics, will try to do that now (see https://github.com/prometheus/client_java/pull/255#issuecomment-307587314)

abesto commented 7 years ago

With the relabeling rules the dashboard can now be significantly smarter, with automatically populated response code count and response time graphs. Updated the dashboard on grafana.net, looks something like this:

screencapture-localhost-3000-dashboard-db-zipkin-prometheus-1497162652511

@kristofa @adriancole Some time has passed, and there are new changes. I think this is ready to merge, waiting for a nod from you :)

abesto commented 7 years ago

Added message count (received and dropped), spans received count, and bytes received graphs to the dashboard. They automatically pick up transports (see the new relabeling rules).

screencapture-localhost-3000-dashboard-db-zipkin-prometheus-1497262914368

abesto commented 7 years ago

After updating the rewrite rules for the changes in openzipkin/zipkin#1609, with https://gist.github.com/abesto/642cd049cc75643213b6e4c23bad7734, here's the current state. Things to note:

auto-populated template variable with the Zipkin instances
added the Zipkin instance to the label of all metrics, except for the response count which sums over all instances (otherwise we get waaay too many metrics). Even though the graph is not exploded by instance, the filtering is still applied
hid the labels from graphs with over two lines of labels
made all the labels human-friendly (no more metric{instance=...})
added 90th percentile response time graph

screencapture-localhost-3000-dashboard-db-zipkin-prometheus-1502919518468

codefromthecrypt commented 7 years ago

very nice.. going out for zipkin 2.2

abesto commented 7 years ago

Cool!

⚠️ Watch out: the version of the dashboard currently on grafana.com works with the pre-Spring metrics. Once 2.2 is released, https://grafana.com/dashboards/1598/ needs to be updated with the JSON in https://gist.github.com/abesto/642cd049cc75643213b6e4c23bad7734.

codefromthecrypt commented 7 years ago

⚠️ Watch out: the version of the dashboard currently on grafana.com works with the pre-Spring metrics. Once 2.2 is released, https://grafana.com/ dashboards/1598/ needs to be updated with the JSON in https://gist.github.com/abesto/642cd049cc75643213b6e4c23bad7734.

uploaded!

openzipkin-attic / docker-zipkin

Fix #121: Add example prometheus setup #135