Closed madwort closed 10 months ago
if we do https://github.com/opensafely-core/job-server/issues/3531 we could (theoretically) (relatively) easily try out a dashboard for that project
I think this has been done and I'll put a link into the dashboards/metrics one pager for reference. @madwort if you have any extra information to add, please add it to this ticket.
⌚ Maximum two days
See if we can set up a dashboard to show
Job Server metrics. The dashboard would be expected to show the current levels of web traffic, a graph of P50/P90/P95 response times, maybe some slow pages and other useful things.job-runner metrics. The dashboard will explore what visualisations are available, and what's possible to produce from our existing traces & metrics.We expect this will be done using the existing OTel traces & metrics.
Questions to answer during this spike:
Implementation notes
725008
and I think the current key is theotlptest
key from thestack-725008-otlptest-integration
access policyemail update 2023-08-25 on metrics from spans
Regarding metrics generator, we can certainly enable that for you.
The metrics generated as part of this feature are written into your Hosted Prometheus instance so they count as active series that are billed like regular metrics: https://grafana.com/docs/grafana-cloud/data-configuration/traces/metrics-generator/#constraints-and-good-to-know
The enablement of the metrics generator is a two-step process. First, we perform a dry run of the service where we generate the metrics but do not write the metrics to your hosted Prometheus instance. The dry run allows you to gauge the number of additional writes we plan to expect from your current usage. The extra write volume can be found in the grafanacloud-usage data source with the grafanacloud_traces_instance_metrics_generator_active_series metric.
At this moment I've place a request to enable the dry run, once it's completed I'll let you know so you can monitor the series generated by this feature and evaluate if you'd like to proceed with enabling remote writes.
Regarding metrics generator, we can certainly enable that for you.
The metrics generated as part of this feature are written into your Hosted Prometheus instance so they count as active series that are billed like regular metrics: https://grafana.com/docs/grafana-cloud/data-configuration/traces/metrics-generator/#constraints-and-good-to-know
The enablement of the metrics generator is a two-step process. First, we perform a dry run of the service where we generate the metrics but do not write the metrics to your hosted Prometheus instance. The dry run allows you to gauge the number of additional writes we plan to expect from your current usage. The extra write volume can be found in the grafanacloud-usage data source with the grafanacloud_traces_instance_metrics_generator_active_series metric.
At this moment I've place a request to enable the dry run, once it's completed I'll let you know so you can monitor the series generated by this feature and evaluate if you'd like to proceed with enabling remote writes.
Great! I have put in a request to enable remote writes with the limit of 400. I will let you know once the request has been approved and deployed.
Public dashboard test https://bennettinstitute.grafana.net/public-dashboards/d8ffe2a1e16845bc89b5fbd74ce7788d
Interesting comment on custom metrics from traces this is what we're trying to do the open source self-hosted Tempo version of this more on span metrics
Thanks! So it looks like some of this data is being rejected/discarded for the reason "outside_metrics_ingestion_slack". Here's a link to the explore search I performed to find this information: https://bennettinstitute.grafana.net/goto/HpN218kSR?orgId=1
According to the metrics generator docs, in the "Monitoring the Metrics-generator" section, it defines this discard reason, outside_metrics_ingestion_slack, as:
The default value in Grafana Cloud of the configuration option metrics_ingestion_time_range_slack, which determines when spans sent to the metrics generator are discarded or rejected, is 30 seconds: https://grafana.com/docs/tempo/latest/configuration/#metrics-generator.
Let me find out if this is something we can increase for you in Grafana Cloud. I will get back to you once I have more information.
example public dashboard https://bennettinstitute.grafana.net/public-dashboards/d8ffe2a1e16845bc89b5fbd74ce7788d