SPIKE Dashboards in Grafana

madwort commented 1 year ago

⌚ Maximum two days

See if we can set up a dashboard to show ~~Job Server metrics. The dashboard would be expected to show the current levels of web traffic, a graph of P50/P90/P95 response times, maybe some slow pages and other useful things.~~ job-runner metrics. The dashboard will explore what visualisations are available, and what's possible to produce from our existing traces & metrics.

We expect this will be done using the existing OTel traces & metrics.

Questions to answer during this spike:

What are the advantages/disadvantages to using Grafana Cloud compared to hosting our own version of Grafana?
- from what I've seen, Grafana Cloud really is just "someone else runs Grafana for you" - so the advantages is a clean "you have to do less work" and the disadvantage is "if you need to change a config you have to ask support". Also, it's unclear right now whether support have a lever to change the option about metric generation ingest timeout (although the underlying Grafana server does)
Can we use the existing OTel traces & metrics or will we need to create some new ones? Should we use OTel for this at all?
- I really think we should use OTel for everything
- One thing that may need to change is the TICK trace mechanism - this is tailored to Honeycomb's UI & I'm not convinced the metrics generation system will be able to make much sense of it. In which case we should probably emit our own custom OTel metrics to do exactly what we want (& then maybe filter TICK to only go to HC & the new metric to only go to Grafana)
Does Seb easily understand the dashboard? Do other members of the tech team?
- unclear
Does the Pipeline Team feel this would provide additional information when trying to debug an issue with Job Server?
- We can provide a different style of interface, but Honeycomb has been working well for us recently with debugging Job Server - I think Grafana may be better for surfacing general "how are things going" type feedback to the wider team (i.e. Seb)
Could we use this solution to provide graphs to our users?
- Yes
What would be needed in order for us to give researchers access to these graphs (maybe not the graphs created in the spike but future ones)?
- We can make dashboards public
- Apparently we can also created embeddable widgets, although I haven't investigated that yet.
What would be needed in order for us to graph some Job Runner metrics and is this something we'd want to do?
- I think we'd need to create some new custom metrics in Job Runner. I don't think this is a huge implementation task once we have a plan, but we may to consider carefully what we want to capture.

Implementation notes

~the Grafana wizard directs you to use Grafana Agent in between OTel Collector & GrafanaCloud, however other docs give alternatives https://grafana.com/docs/opentelemetry/collector/send-otlp-to-grafana-cloud-databases/~ we can send OTel directly from the Collector to GrafanaCloud
Grafana config: https://github.com/opensafely-core/sysadmin/pull/122 - needed some fixes along the way:
- https://github.com/opensafely-core/sysadmin/pull/123
- https://github.com/opensafely-core/sysadmin/pull/124
credentials used: instance_id 725008 and I think the current key is the otlptest key from the stack-725008-otlptest-integration access policy
metrics - metrics meaning numerical values sent from the server/app is currently limited to some server metrics (cpu, memory, etc), they should appear in "Prometheus" when we start sending production data (currently test metrics are visible)
traces - we should be able to view traces in a useful way with Grafana, once we deploy the otel-gateway to send prod traces
monitoring - what I think we're currently doing with Honeycomb & hoping to do with Grafana is metrics from spans - sadly "Metrics generation is disabled by default. Contact Grafana Support to enable metrics generation for your organization." - I have emailed support asking for this.
support are going to enable the metrics from spans (see below) - further reading https://grafana.com/docs/tempo/latest/metrics-generator/ or https://grafana.com/docs/grafana-cloud/monitor-infrastructure/traces/metrics-generator/ and https://grafana.com/docs/grafana-cloud/monitor-infrastructure/traces/metrics-summary-api/ - it's a bit unclear to me exactly what this is going to do, but might make more sense when we can try it out - Update: this was enabled in dry-run mode, then allegedly enabled for real, but I'm still seeing no data.
public dashboards are also a preview feature, I have emailed support asking them to enable it for us https://grafana.com/docs/grafana/latest/dashboards/dashboard-public/
I've been experimenting with TraceQL / creating a dashboard etc - it doesn't seem as intuitive as honeycomb, but maybe I'm just habituated to honeycomb

email update 2023-08-25 on metrics from spans

Regarding metrics generator, we can certainly enable that for you.

The metrics generated as part of this feature are written into your Hosted Prometheus instance so they count as active series that are billed like regular metrics: https://grafana.com/docs/grafana-cloud/data-configuration/traces/metrics-generator/#constraints-and-good-to-know

The enablement of the metrics generator is a two-step process. First, we perform a dry run of the service where we generate the metrics but do not write the metrics to your hosted Prometheus instance. The dry run allows you to gauge the number of additional writes we plan to expect from your current usage. The extra write volume can be found in the grafanacloud-usage data source with the grafanacloud_traces_instance_metrics_generator_active_series metric.

At this moment I've place a request to enable the dry run, once it's completed I'll let you know so you can monitor the series generated by this feature and evaluate if you'd like to proceed with enabling remote writes.

Regarding metrics generator, we can certainly enable that for you.

The metrics generated as part of this feature are written into your Hosted Prometheus instance so they count as active series that are billed like regular metrics: https://grafana.com/docs/grafana-cloud/data-configuration/traces/metrics-generator/#constraints-and-good-to-know

The enablement of the metrics generator is a two-step process. First, we perform a dry run of the service where we generate the metrics but do not write the metrics to your hosted Prometheus instance. The dry run allows you to gauge the number of additional writes we plan to expect from your current usage. The extra write volume can be found in the grafanacloud-usage data source with the grafanacloud_traces_instance_metrics_generator_active_series metric.

At this moment I've place a request to enable the dry run, once it's completed I'll let you know so you can monitor the series generated by this feature and evaluate if you'd like to proceed with enabling remote writes.

Great! I have put in a request to enable remote writes with the limit of 400. I will let you know once the request has been approved and deployed.

Public dashboard test https://bennettinstitute.grafana.net/public-dashboards/d8ffe2a1e16845bc89b5fbd74ce7788d

Interesting comment on custom metrics from traces this is what we're trying to do the open source self-hosted Tempo version of this more on span metrics

Thanks! So it looks like some of this data is being rejected/discarded for the reason "outside_metrics_ingestion_slack". Here's a link to the explore search I performed to find this information: https://bennettinstitute.grafana.net/goto/HpN218kSR?orgId=1

According to the metrics generator docs, in the "Monitoring the Metrics-generator" section, it defines this discard reason, outside_metrics_ingestion_slack, as:

The time between the creation of the span and when it was ingested was too large and the span is deemed outdated. Processing this span and including it a current metrics sample would skew the data.

The default value in Grafana Cloud of the configuration option metrics_ingestion_time_range_slack, which determines when spans sent to the metrics generator are discarded or rejected, is 30 seconds: https://grafana.com/docs/tempo/latest/configuration/#metrics-generator.

Let me find out if this is something we can increase for you in Grafana Cloud. I will get back to you once I have more information.

example public dashboard https://bennettinstitute.grafana.net/public-dashboards/d8ffe2a1e16845bc89b5fbd74ce7788d

madwort commented 1 year ago

if we do https://github.com/opensafely-core/job-server/issues/3531 we could (theoretically) (relatively) easily try out a dashboard for that project

lucyb commented 1 year ago

I think this has been done and I'll put a link into the dashboards/metrics one pager for reference. @madwort if you have any extra information to add, please add it to this ticket.

opensafely-core / job-runner

SPIKE Dashboards in Grafana #649