prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.6k stars 2.15k forks source link

Add basic arithmetic functions to templating funcmap #1188

Open roganartu opened 6 years ago

roganartu commented 6 years ago

Why

I link to grafana dashboards from my prometheus alerts, using some alert labels as grafana variables in the URL to narrow down the dashboard queries. I would like to link to a specific range (eg: 30 mins before/after) around StartsAt and/or EndsAt instead of having to adjust the timeline after opening the link, but this requires basic arithmetic functions (ie: add/sub). In addition, grafana URL params expect unix timestamps in milliseconds, making StartsAt and EndsAt currently unusable, forcing me to return to the alert and manually translate the times from it to grafana via a date picker.

Proposal

Add the following self-explanatory arithmetic functions to DefaultFuncs in https://github.com/prometheus/alertmanager/blob/master/template/template.go

  1. add
  2. sub
  3. div
  4. mul

From a user perspective, I don't want to worry about types here, so these functions will require type assertions to differentiate between floats and ints (and maybe strconv.Atoi if string, but I'm not sold on the usefulness of that). Still, this should be relatively simple to implement.

Add the following two functions to allow the use of the above arithmetic functions to manipulate the timestamps in StartsAt and EndsAt:

  1. toUnix
  2. fromUnix

This is similar to #603 but that request has other concerns about manipulating/accessing current dates, sorting lists etc, so I thought it worth separating the two.

brian-brazil commented 6 years ago

Anything like this should be done down in alert templates in Prometheus.

roganartu commented 6 years ago

The problem with putting this in Prometheus alert templates is that it will cause a huge amount of duplication in this case.

To solve my given example with my proposal I would need to add something like the following to a single shared template in alertmanager:

{{ $url := (printf "%s?from=%s" $url (.StartsAt | toUnix | sub 1800000)) }}
{{ $url := (printf "%s&to=%s" $url (.EndsAt | toUnix | add 1800000)) }}

That I can then include wherever needed with {{ template "grafana.link.href.partial" . }}.

To achieve the same if these functions instead existed in Prometheus alert templates would require adding the following to every single rule:

annotations:
  start_unix: {{ .StartsAt | toUnix | sub 1800000 }}
  end_unix: {{ .EndsAt | toUnix | add 1800000) }}

As well as having a line in the alertmanager template to extract the annotation anyway.

Additionally, timestamp data seemingly isn't exposed to Prometheus alert templates. Unless I'm missing something, only labels and the raw sample value are exposed: https://github.com/prometheus/prometheus/blob/master/rules/alerting.go#L196-L202 Even if this timestamp data were to be exposed to the template here it wouldn't be the ActiveAt timestamp (same as StartsAt in alertmanager?), which is the useful one for this example.

brian-brazil commented 6 years ago

The StartsAt and EndsAt aren't exactly reliable, and may be zero depending on the current state of the alert. They're more an implementation detail than anything.

Usually you also want context on an alert, not merely when it get bad enough to start firing. What I'd suggest is creating links to Grafana with fixed parameters such as &from=now-6h&to=now or rely on the defaults for the dashboard which (presumably) have an appropriate value for the time range already.

carlosflorencio commented 6 years ago

Ugly workaround for now:

groups:
- name: testalert
  rules:
  - record: grafanaFrom
    expr: vector((time() - (30*60))*1000)
  - record: grafanaTo
    expr: vector((time() + (30*60))*1000)
  - alert: IgnoreAlert
    expr: vector(1)
    for: 10s
    labels:
      severity: major
      grafana: "http://grafana.board.local?{{ printf \"from=%.0f&to=%.0f\" (query \"grafanaFrom\" | first | value) (query \"grafanaTo\" | first | value) }}"
    annotations:
      summary: Daily alert test summary
      description: Daily alert test description
simonpasquier commented 6 years ago

Note that the grafana link should be an annotation and not a label (see https://github.com/prometheus/prometheus/issues/4652 for the details).

ServerNinja commented 5 years ago

The StartsAt and EndsAt aren't exactly reliable, and may be zero depending on the current state of the alert. They're more an implementation detail than anything.

Usually you also want context on an alert, not merely when it get bad enough to start firing. What I'd suggest is creating links to Grafana with fixed parameters such as &from=now-6h&to=now or rely on the defaults for the dashboard which (presumably) have an appropriate value for the time range already.

I would argue that being able to produce a graph attached to an alert with the timeframe the alert occurred as opposed to (now-6h to now) would be ideal for gathering data and graphs to prepare for postmortems. It seems like it would be very beneficial.

Ideally, one would do something like this in the alert template:

https://grafana.url:xxx/dashboard?var-pod_name={{ .Labels.pod_name }}&from={{ .StartsAt | UnixDate }}-15m&to={{ .EndsAt | UnixDate }}
Tyson1986 commented 5 years ago

Any updates? I tried to put Splunk and Grafana links to Splunk alert template with timestamps. I still haven't found a good solution. IMHO put relative links like now-6h to now is bad practice. Sometime you'd like to use this link after some time, for example after the weekends. As of now closest solution is use: {{ with query "time()" }}{{ . | first | value | printf "%.0f"}}{{ end }}

Yapcheekian commented 4 years ago

Any updates? I tried to put Splunk and Grafana links to Splunk alert template with timestamps. I still haven't found a good solution. IMHO put relative links like now-6h to now is bad practice. Sometime you'd like to use this link after some time, for example after the weekends. As of now closest solution is use: {{ with query "time()" }}{{ . | first | value | printf "%.0f"}}{{ end }}

Do you have any idea how to trim the whitespace at the begin and end of the timestamp?

bastibrunner commented 3 years ago

I found this thread while searching for grafana timerange but in alertmanager templates. This is my solution, maybe it helps someone else:

&time={{- (index .Alerts 0).StartsAt.Unix -}}000&time.window=600000

Alexander-Bartosh commented 3 years ago

Guys I needed StartsAt - 10m

This {{ (.StartsAt.Add -600000000000 ).Unix }}000 Did the trick for me with Grafana.

Logs: <{{ $.ExternalURL }}/explore?orgId=1&left=%5B%22{{ (.StartsAt.Add -600000000000 ).Unix }}000%22,%22{{if eq .Status "firing" }}now{{ else }}{{ .EndsAt.Unix }}000{{ end }}%22,%22Loki%22,%7B%22expr%22:%22{{ urlquery .Annotations.logsExpr | reReplaceAll "\+" "%20" | reReplaceAll "%5C" "%5C%5C" | reReplaceAll "%22" "%5C%22" }}%22%7D%5D|:chart_with_upwards_trend: Graph>

roidelapluie commented 3 years ago

I also use {{.StartsAt.Add -600000000000.Unix}}000.

I think we can close this issue.

hanikesn commented 3 years ago

I think it makes sense to document the workaround in the official documentation as it isn't obvious for most people.

ismarslomic commented 3 years ago

I have spent many hours finding this issue and workarounds. So I think definitely that official docs should be updated with examples and tips. Linking to the Grafana dashboard with time range is crucial. But what would be even better is to have variables and functions to support this functionality.

Thanks to all contributing with useful workarounds!

diversario commented 2 years ago

There's still no basic math available, though.

grobinson-grafana commented 10 months ago

There are no integer or decimal fields in the template data as far as I can tell, so in what situations would having Math functions be useful? (template.go#L296-L317)

nikita2206 commented 10 months ago

@grobinson-grafana there is {{ $value }}, take for example kube_job_status_start_time and you could use that value (unix ts) to generate a link to logs with sensible timestamp bounds

grobinson-grafana commented 10 months ago

@nikita2206 There is $value in Prometheus. However, this issue is talking about Alertmanager, and there is no $value in Alertmanager as far as I know?

nikita2206 commented 10 months ago

@grobinson-grafana To be more specific, here is my use case: (including the workaround)

  - alert: KubeCronJobFailing2Hours
    expr: |
      (kube_job_failed{condition="true"} > 0)
        * on (job_name) group_right ()
          label_replace(kube_job_owner{owner_kind="CronJob"}, "cronjob", "$0", "owner_name", ".*")
        * on (job_name) group_left ()
          kube_job_status_start_time
      unless on (cronjob)
        label_replace(
          present_over_time(kube_job_status_completion_time[2h]),
          "cronjob", "$1", "job_name", "^(.+)-\\d+$")
    annotations:
      type: Job
      cronjob: "{{ $labels.cronjob }}"
      message: >
        CronJob `{{ $labels.cronjob }}` is failing and hasn't completed successfully for at least 2 hours,
        last attempt was at {{ $value | humanizeTimestamp }},
        <https://logs-backend.internal/logs?filter=trace-id%3D%27{{ $labels.job_name }}%27&startTime={{ (printf "vector(%f - 10)" .Value) | query | first | value | printf "%.0f" }}&endTime={{ (printf "vector(%f + 1800)" .Value) | query | first | value | printf "%.0f" }}|logs here>.

As you can see, I would like to include a link to the logs, which needs time bounds. Sensible time bounds, given that the start timestamp of the Job is known, would be something like '10 seconds before the job started' until '30 minutes after the job started'.