SLO dashboards/alerts are not working in 1.18+

povilasv commented 4 years ago

Some SLO dashboards/alerts are not working

Generated rule:

record: apiserver_request:availability30d
expr: 1
  - (sum(increase(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET"}[30d]))
  - (sum(increase(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[30d]))
  + sum(increase(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[30d]))
  + sum(increase(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[30d])))
  + sum(code:apiserver_request_total:increase30d{code=~"5..",verb="read"}))
  / sum(code:apiserver_request_total:increase30d{verb="read"})
labels:
  verb: read

For example this part returns no data:

sum(increase(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[30d]))

I believe we need to add OR on() vector(0) to all those sum(increase()) parts.

Prometheus:

Grafana:

cc @metalmatze @brancz

metalmatze commented 4 years ago

Can you check the log of that Prometheus. I already got reports that the recording rules might load too many samples.

povilasv commented 4 years ago

Can't see anything interesting:

+ prometheus-0 › prometheus
prometheus-0 prometheus level=warn ts=2020-04-14T02:59:12.021Z caller=main.go:287 deprecation_notice="'storage.tsdb.retention' flag is deprecated use 'storage.tsdb.retention.time' instead."
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.022Z caller=main.go:333 msg="Starting Prometheus" version="(version=2.17.1, branch=master, revision=3df7a0b734c9a3bebddf22016e650ee81ade8357)"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.022Z caller=main.go:334 build_context="(go=go1.14.2, user=root@a440986f0462, date=20200413-14:06:48)"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.022Z caller=main.go:335 host_details="(Linux 5.6.2-1-ARCH #1 SMP Sun Apr 5 19:58:55 UTC 2020 aarch64 prometheus-0 (none))"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.023Z caller=main.go:336 fd_limits="(soft=1048576, hard=1048576)"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.023Z caller=main.go:337 vm_limits="(soft=unlimited, hard=unlimited)"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.093Z caller=main.go:666 msg="Starting TSDB ..."
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.093Z caller=web.go:517 component=web msg="Start listening for connections" address=0.0.0.0:9090
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.227Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1584872536373 maxt=1585137600000 ulid=01E4GKWFH76MGBM034EB04A90M
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.237Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1585137600000 maxt=1585720800000 ulid=01E4TFHR7WZQQ4RWE2W4GH6Z3S
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.244Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1585720800000 maxt=1586304000000 ulid=01E5BVR4KVAT9FNY8GK4WECPMS
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.252Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1586304000000 maxt=1586498400000 ulid=01E5HN3XG8WXW7B74QNMJ3JCM5
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.255Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1586498400000 maxt=1586692800000 ulid=01E5QEGG3AK621V17M40KZCGDP
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.256Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1586692800000 maxt=1586757600000 ulid=01E5SC9PXTD6Z1KP8G9AF202VK
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.258Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1586757600000 maxt=1586779200000 ulid=01E5T0WMK654F2P9WT66W1VZQW
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.267Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1586800800000 maxt=1586808000000 ulid=01E5TNF0N9QJH6NS5CKTHB0T91
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.271Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1586779200000 maxt=1586800800000 ulid=01E5TNFMFRAH052HZW7QZA98R6
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.275Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1586808000000 maxt=1586815200000 ulid=01E5TWAPDNNTSX00P5XSFPR6TF
prometheus-0 prometheus level=info ts=2020-04-14T02:59:12.282Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1586815200000 maxt=1586822400000 ulid=01E5V3DFM1BSG746DVBVHEKEYR
prometheus-0 prometheus level=info ts=2020-04-14T02:59:13.150Z caller=head.go:575 component=tsdb msg="Replaying WAL, this may take awhile"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:21.253Z caller=head.go:600 component=tsdb msg="WAL checkpoint loaded"
prometheus-0 prometheus level=warn ts=2020-04-14T02:59:31.991Z caller=head.go:558 component=tsdb msg="Unknown series references" count=60
prometheus-0 prometheus level=info ts=2020-04-14T02:59:31.992Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=329 maxSegment=332
prometheus-0 prometheus level=info ts=2020-04-14T02:59:32.976Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=330 maxSegment=332
prometheus-0 prometheus level=info ts=2020-04-14T02:59:37.713Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=331 maxSegment=332
prometheus-0 prometheus level=info ts=2020-04-14T02:59:37.714Z caller=head.go:624 component=tsdb msg="WAL segment loaded" segment=332 maxSegment=332
prometheus-0 prometheus level=info ts=2020-04-14T02:59:37.715Z caller=head.go:627 component=tsdb msg="WAL replay completed" duration=24.564294265s
prometheus-0 prometheus level=info ts=2020-04-14T02:59:40.261Z caller=main.go:682 fs_type=EXT4_SUPER_MAGIC
prometheus-0 prometheus level=info ts=2020-04-14T02:59:40.261Z caller=main.go:683 msg="TSDB started"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:40.264Z caller=main.go:787 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
prometheus-0 prometheus level=info ts=2020-04-14T02:59:40.297Z caller=kubernetes.go:253 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:40.348Z caller=kubernetes.go:253 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:40.376Z caller=kubernetes.go:253 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:40.394Z caller=kubernetes.go:253 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:40.410Z caller=kubernetes.go:253 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
prometheus-0 prometheus level=info ts=2020-04-14T02:59:40.982Z caller=main.go:815 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
prometheus-0 prometheus level=info ts=2020-04-14T02:59:41.010Z caller=main.go:634 msg="Server is ready to receive web requests."
prometheus-0 prometheus level=info ts=2020-04-14T03:00:26.671Z caller=compact.go:495 component=tsdb msg="write block" mint=1586822400000 maxt=1586829600000 ulid=01E5VA2706SVQKQTG83SB1H8X7 duration=24.488985727s
prometheus-0 prometheus level=info ts=2020-04-14T03:00:28.797Z caller=head.go:662 component=tsdb msg="Head GC completed" duration=789.952594ms
prometheus-0 prometheus level=info ts=2020-04-14T03:01:05.477Z caller=head.go:734 component=tsdb msg="WAL checkpoint complete" first=329 last=330 duration=36.679591967s
prometheus-0 prometheus level=info ts=2020-04-14T03:02:21.966Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1586800800000 maxt=1586822400000 ulid=01E5VA4628M6T2F5F40VZP6FW9 sources="[01E5TNF0N9QJH6NS5CKTHB0T91 01E5TWAPDNNTSX00P5XSFPR6TF 01E5V3DFM1BSG746DVBVHEKEYR]" duration=1m15.205954653s
prometheus-0 prometheus level=info ts=2020-04-14T03:03:22.440Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1586757600000 maxt=1586822400000 ulid=01E5VA6HF75XBRZZ28CZ682MK6 sources="[01E5T0WMK654F2P9WT66W1VZQW 01E5TNFMFRAH052HZW7QZA98R6 01E5VA4628M6T2F5F40VZP6FW9]" duration=58.466114211s
prometheus-0 prometheus level=info ts=2020-04-14T05:00:12.921Z caller=compact.go:495 component=tsdb msg="write block" mint=1586829600000 maxt=1586836800000 ulid=01E5VGXXNABWGAKEJ8P6986EWY duration=11.343420123s
prometheus-0 prometheus level=info ts=2020-04-14T05:00:14.100Z caller=head.go:662 component=tsdb msg="Head GC completed" duration=327.221953ms

povilasv commented 4 years ago

Additionally I did kubectl proxy --port=8080 and looked at http://localhost:8080/metrics.

An I grepped thru metrics which has verb="LIST" or verb="GET" for scope="resource":

The only metric with scope="resource" has a verb=CONNECT:

apiserver_request_duration_seconds_bucket{component="apiserver",dry_run="",group="",resource="nodes",scope="resource",subresource="proxy",verb="CONNECT",version="v1",le="0.05"} 5939
....
apiserver_request_duration_seconds_bucket{component="apiserver",dry_run="",group="",resource="nodes",scope="resource",subresource="proxy",verb="CONNECT",version="v1",le="+Inf"} 18027

So this query will definitely result in no data:

sum(increase(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[30d]))

@metalmatze are you sure those are correct queries?

I've attached the full dump of /metrics. metrics.txt

metalmatze commented 4 years ago

What version of Kubernetes are you running here @povilasv? I tested these rules against Kubernetes 1.16.x.

povilasv commented 4 years ago

1.18.

I also tested that on 1.15 / 1.16 and it actually works. Maybe it's my weird setup with 1.18. Will retest on minikube.

metalmatze commented 4 years ago

I could reproduce this on a Kubernetes 1.18 too. It seems like this metric has changed. Maybe @brancz knows something about apiserver_request_duration_seconds_bucket changes from SIG instrumentation? I'm not aware of anything.

metalmatze commented 4 years ago

To fix it on our side we could do

- sum(increase(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[30d]))
+ sum(increase(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[30d]))

I'm not 100% sure if that's still the correct thing to calculate. Not sure what scope="" actually means here.

povilasv commented 4 years ago

:+1: I would say let's fix it using scope=~"resource|", the change is harmless for existing metrics and starts working on 1.18

brancz commented 4 years ago

can we first find out what the other possible values are for this label?

metalmatze commented 4 years ago

I've gone ahead to at least make it work on Kubernetes 1.18, but I agree with @brancz that we need to know. From a quick look at a 1.18 cluster it looks like the scopes are

"cluster"
"namespace"
""

Essentially looking like it might be a regression in Kubernetes 1.18 where resource isn't properly populated?!

logicalhan commented 4 years ago

can we first find out what the other possible values are for this label?

We do some weird stuff with this one.

logicalhan commented 4 years ago

I've gone ahead to at least make it work on Kubernetes 1.18, but I agree with @brancz that we need to know. From a quick look at a 1.18 cluster it looks like the scopes are

"cluster"

"namespace"

""

Essentially looking like it might be a regression in Kubernetes 1.18 where resource isn't properly populated?!

There are two relevant pieces of code (to my knowledge). The first:

https://github.com/kubernetes/kubernetes/blob/54e1e5b62f9dd6d069399400afd5b015cabda351/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L315-L327

and the second:

https://github.com/kubernetes/kubernetes/blob/54e1e5b62f9dd6d069399400afd5b015cabda351/staging/src/k8s.io/apiserver/pkg/endpoints/installer.go#L581-L596

beorn7 commented 4 years ago

I already got reports that the recording rules might load too many samples.

Just ran into this problem. The 30d range rule is loading way too many samples even for moderately sized clusters.

This one:

record: code_verb:apiserver_request_total:increase30d
expr: sum
  by(code, verb) (increase(apiserver_request_total{job="default/kubernetes"}[30d]))

I guess we need to do something with a 1d or 1h increase rule and then do an avg_over_time on that in a 2nd rule.

Should I perhaps file a separate issue for that?

brancz commented 4 years ago

@beorn7 I'd say so yeah

beorn7 commented 4 years ago

Here you are: #411

github-actions[bot] commented 1 week ago

This issue has not had any activity in the past 30 days, so the stale label has been added to it.

The stale label will be removed if there is new activity
The issue will be closed in 7 days if there is no new activity
Add the keepalive label to exempt this issue from the stale check action

Thank you for your contributions!

kubernetes-monitoring / kubernetes-mixin

SLO dashboards/alerts are not working in 1.18+ #394