Rule: Failing evaluation of rules with no metric only number

FUSAKLA commented 6 years ago

Thanos, Prometheus and Golang version used Build from master 3fd740f

# thanos --version
thanos, version 0.1.0 (branch: non-git, revision: non-git)
  build user:       root@runner-aac4221e-project-8851-concurrent-0
  build date:       20181027-21:20:04
  go version:       go1.11.1

What happened I deployed rule node and put there some random alerts and rules we have just to test if it works and log is full of warnings such as.

thanos-rule-0 thanos-rule level=warn ts=2018-10-28T00:50:04.228567229Z caller=manager.go:343 component=rules group=processes-alerts msg="Evaluating rule failed" rule="record: alerting_threshold:last_successful_run_timestamp\nexpr: \"10000\"\nlabels:\n  app: foobar\n  team: xxx@yyyy.cz\n" err="json: cannot unmarshal number into Go value of type struct { Metric model.Metric \"json:\\\"metric\\\"\"; Value model.SamplePair \"json:\\\"value\\\"\" }"

The recording rule reffered in the log is:

record: alerting_threshold:last_successful_run_timestamp 
expr: 10000 
labels: 
  app: foobar 
  team: xxx@yyyy.cz

How to reproduce it (as minimally and precisely as possible): I suspect that he cause is the expr containing only number and no metric. I have more rules in there and only those used just as a thresholds causes those warnings.

The error message is not that completely clear but it also sounds like the issue is that when un-marshaling it's not expecting to get only a number.

On Prometheus instance version 2.3.2 it's working ok.

FUSAKLA commented 6 years ago

Could this comment be relevant?

https://github.com/improbable-eng/thanos/blob/3fd740ffcea8182728ee92e973d3e81aa831c026/cmd/thanos/rule.go#L597

bwplotka commented 6 years ago

Yup. What's the use case for scalar one then? (:

I feel like we can add scalar support if needed, but trying to understand the use case (:

FUSAKLA commented 6 years ago

Well generally we use those rules for threshold alerting where mostly the alert is still the same so we just join the metrics based on label with scalar threshold with particular label.

example:

 - record: alerting_threshold:barrel_creation_time_seconds
    expr: "8700"
    labels:
      name: coec-barrels
  - record: alerting_threshold:barrel_creation_time_seconds
    expr: "7000"
    labels:
      name: search-barrels
 # metric `barrel_creation_time` has label `name` which matches those `name` labels of ` alerting_threshold:barrel_creation_time_seconds`
  - alert: AgeOfBarrel
    expr: (time() - (barrel_creation_time > 0)) > ON(name) GROUP_LEFT() alerting_threshold:barrel_creation_time_seconds
    for: 5m
    labels:
      team: xxx
      severity: warning
      channel: xxx
    annotations:
      title: Age of {{$labels.app}} barrel is too old
      description: Barrel {{$labels.name}} of app {{$labels.app}} is too old ({{$value
        | humanizeDuration}}) on {{$labels.instance}}.{{$labels.locality}}).

This is pretty common I think. There is even blog post about this on Robust Perception https://www.robustperception.io/using-time-series-as-alert-thresholds.

I'm not blocked by this, as I said it was just testing case. Those alerts will be on Prometheis instances but I can imagine the alert could need data from multiple Prometheis instances. Than you'd need to record those threshold rules on one one of the Prometheis instances and use them in the Thanos rule node in the alert. But this would spread configuration of the one alert on two distinct components which would be kind of uncomfortable but still possible.

I'd say adding note about this to the documentation could be sufficient solution for now. Possibly the support can be added when needed as you mentioned, I'm not pushing it

bwplotka commented 6 years ago

Sounds like a valid use case for me, was looking exactly for this justification (:

thanos-io / thanos

Rule: Failing evaluation of rules with no metric only number #598