suggestion: chart based exploration/analysis

Dieterbe commented 7 years ago

jaeger is a fantastic tool, and we've only been using it for a few weeks, so maybe i'm missing something. but wanted to start this conversation and see if it resonates with anyone.

jaeger has some limitations:

the span result set is often overwhelming, hard to find interesting spans. sort by span length is useful but doesn't take into account legitimate reasons why certain spans may naturally take longer (e.g. a specific tenant is known to do slower queries)
can be hard to know what to look for

suggestion: a chart builder interface that shows 1) charts over time 2) statistical breakdowns (e.g. histograms) not over time.

the idea is that if you're able to modify : A which spans/services are being plotted B which tags are included/excluded for the data that makes it into the chart C which tag's values will create separate charts or separate lines on the chart

you can gather insights into the overall health of stuff first, and spot interesting anomalies and then from there dive deeper into specific spans.

a commercial/proprietary example of something like this would be https://honeycomb.io/

jpkrohling commented 7 years ago

I see Jaeger as a tracing backend and it's very good at it. On an application that is instrumented with OT already, it should be possible to report data also to a metrics system, like Prometheus. With that, data can be shown on custom dashboard, like, say, Grafana :)

In fact, there's a blog post showing exactly that: http://www.hawkular.org/blog/2017/06/26/opentracing-appmetrics.html

What might be interesting is to have this extra reporter built into the Jaeger collector, so that the target app won't need to ship with a metrics reporter.

vprithvi commented 7 years ago

I agree with @Dieterbe, discoverability of traces is among the biggest problems we are tackling right now.

We have internal tools that aggregate spans into traces, and allow visualizing histograms using data from these traces. They allow slicing by inbounds/outbounds and support drill down from a latency histogram to individual traces. These tools have internal dependencies, and we are in a very preliminary process of moving them to open source.

I agree that having traces connected to something like Apache Zeppelin would greatly aid power users in performing arbitrarily complex analysis. We have plans to make a Apache Flink based trace id pipeline available open source in a few months which would make such integrations easier.

yurishkuro commented 7 years ago

@Dieterbe please see http://jaeger.readthedocs.io/en/latest/roadmap/#latency-histograms

Dieterbe commented 7 years ago

that looks very useful, and regular histograms seem like a better tool than over-time plots for most of the cases.

mnovinger commented 6 years ago

@yurishkuro any progress on the histograms? Or is that work available somewhere for contribution?

yurishkuro commented 6 years ago

@mnovinger no, unfortunately. At Uber we're mostly (so far) focused on comprehensive trace analysis rather than individual service optimizations, so histograms are not a high priority (since they are single-service). They can also be obtained from regular per-endpoint metrics (without partitioning by upstream caller as in the link I shared), and once you know a range of latencies that you want to investigate you can use the duration-based search to find sampled traces. It's a workaround, having histograms directly would've been better, of course.

We're very close to open sourcing our Flink-based data pipeline (the Kafka streaming already merged on master). Doing latency histograms will be fairly trivial once it's there.

sobvan commented 5 years ago

This is an old topic, but may I suggest you should simply use Kibana with elasticsearch for chart based exploration/analysis. It is really powerful and easy.

jpkrohling commented 5 years ago

@istvanszoboszlai have you done this already? It would be a nice blog post.

jaegertracing / jaeger

suggestion: chart based exploration/analysis #358