opensearch-project / dashboards-observability

Visualize and explore your logs, traces and metrics data in OpenSearch Dashboards
https://opensearch.org/docs/latest/observability-plugin/index/
Apache License 2.0
14 stars 52 forks source link

[BUG] Trace Analytics - Jaeger - coalesce all breakdown into one table, one graph/chart #212

Open pjfitzgibbons opened 1 year ago

pjfitzgibbons commented 1 year ago

What is the bug? Currently, the Trace Analytics Dashboard, when viewed on a Jaeger datatype, presents the following graph and table : image

Note that the radio-buttons "Error Rate" / "Throughput" change both the displayed visualization as well as the table. "Latency" is not included as a visualization

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

What is the expected behavior? The aggregations of Error Rate, Latency, and Throughput are actually inter-related, yet somewhat orthogonal measurements of an application operational quality. These three measurements mean different things, yet together can reduce the possible surface-area of an operational issue.

Examples : 1. A spiked Error Rate could be combined with a coordinate spike in Throughput - thumbnail analysis : traffic spike is taxing the system, and causing availability problems. 2. A reduction in Throughput combined with an increase in latency could represent a dependency issue or change in application logic that has affected per-request performance.

These measurements are very often analyzed in concert while monitoring and troubleshooting a system.

Recommendations :

  1. Extend the table to include columns for Error Rate, Throughput, AND Latency. Column sorting can easily present the user with "Top 5" of each measurement. A "Top 5 Worst Endpoints" could be achieved by weighted sorting of the combined three measurements - this would be a sort of heat-map of trouble in the monitored system. Column-sorting should be mirrored in the URL (query string ?) to allow sharing of the display as-configured for discussion of measurements in a specific context.

  2. Display the visualization with breakdown lines for each of Error Rate, Throughput, and Latency. One may be selected "by default" as a lone graph. If so, allow user-configuration of the "default" displayed breakdown. Allow checkboxes on the visualization Legend to show/hide each breakdown line. The reasoning for this functionality is aligned with the background above - these measurements are often analyzed in concert, and the visualization is the quickest way for humans to correlate anomalous or proportional changes in each measurement.

What is your host/environment?

Do you have any screenshots? If applicable, add screenshots to help explain your problem.

Do you have any additional context? Add any other context about the problem.

pjfitzgibbons commented 1 year ago

@kavck @kgcreative @derek-ho Putting this on our backlog for discussion and consideration. I have time to collaborate a UX mock of my expected design if that is appropriate.

kgcreative commented 1 year ago

@pjfitzgibbons That'd be super helpful!