opensafely-core / job-runner

A client for running jobs in an OpenSAFELY secure environment, requested via job-server (q.v.)
Other
3 stars 5 forks source link

SPIKE Dashboards in Grafana #649

Closed madwort closed 10 months ago

madwort commented 1 year ago

⌚ Maximum two days

See if we can set up a dashboard to show Job Server metrics. The dashboard would be expected to show the current levels of web traffic, a graph of P50/P90/P95 response times, maybe some slow pages and other useful things. job-runner metrics. The dashboard will explore what visualisations are available, and what's possible to produce from our existing traces & metrics.

We expect this will be done using the existing OTel traces & metrics.

Questions to answer during this spike:

Implementation notes


email update 2023-08-25 on metrics from spans

Regarding metrics generator, we can certainly enable that for you.

The metrics generated as part of this feature are written into your Hosted Prometheus instance so they count as active series that are billed like regular metrics: https://grafana.com/docs/grafana-cloud/data-configuration/traces/metrics-generator/#constraints-and-good-to-know

The enablement of the metrics generator is a two-step process. First, we perform a dry run of the service where we generate the metrics but do not write the metrics to your hosted Prometheus instance. The dry run allows you to gauge the number of additional writes we plan to expect from your current usage. The extra write volume can be found in the grafanacloud-usage data source with the grafanacloud_traces_instance_metrics_generator_active_series metric.

At this moment I've place a request to enable the dry run, once it's completed I'll let you know so you can monitor the series generated by this feature and evaluate if you'd like to proceed with enabling remote writes.


Regarding metrics generator, we can certainly enable that for you.

The metrics generated as part of this feature are written into your Hosted Prometheus instance so they count as active series that are billed like regular metrics: https://grafana.com/docs/grafana-cloud/data-configuration/traces/metrics-generator/#constraints-and-good-to-know

The enablement of the metrics generator is a two-step process. First, we perform a dry run of the service where we generate the metrics but do not write the metrics to your hosted Prometheus instance. The dry run allows you to gauge the number of additional writes we plan to expect from your current usage. The extra write volume can be found in the grafanacloud-usage data source with the grafanacloud_traces_instance_metrics_generator_active_series metric.

At this moment I've place a request to enable the dry run, once it's completed I'll let you know so you can monitor the series generated by this feature and evaluate if you'd like to proceed with enabling remote writes.


Great! I have put in a request to enable remote writes with the limit of 400. I will let you know once the request has been approved and deployed.


Public dashboard test https://bennettinstitute.grafana.net/public-dashboards/d8ffe2a1e16845bc89b5fbd74ce7788d


Interesting comment on custom metrics from traces this is what we're trying to do the open source self-hosted Tempo version of this more on span metrics


Thanks! So it looks like some of this data is being rejected/discarded for the reason "outside_metrics_ingestion_slack". Here's a link to the explore search I performed to find this information: https://bennettinstitute.grafana.net/goto/HpN218kSR?orgId=1

According to the metrics generator docs, in the "Monitoring the Metrics-generator" section, it defines this discard reason, outside_metrics_ingestion_slack, as:

The time between the creation of the span and when it was ingested was too large and the span is deemed outdated. Processing this span and including it a current metrics sample would skew the data.

The default value in Grafana Cloud of the configuration option metrics_ingestion_time_range_slack, which determines when spans sent to the metrics generator are discarded or rejected, is 30 seconds: https://grafana.com/docs/tempo/latest/configuration/#metrics-generator.

Let me find out if this is something we can increase for you in Grafana Cloud. I will get back to you once I have more information.


example public dashboard https://bennettinstitute.grafana.net/public-dashboards/d8ffe2a1e16845bc89b5fbd74ce7788d

madwort commented 1 year ago

if we do https://github.com/opensafely-core/job-server/issues/3531 we could (theoretically) (relatively) easily try out a dashboard for that project

lucyb commented 1 year ago

I think this has been done and I'll put a link into the dashboards/metrics one pager for reference. @madwort if you have any extra information to add, please add it to this ticket.