kubernetes-monitoring / kubernetes-mixin

A set of Grafana dashboards and Prometheus alerts for Kubernetes.
Apache License 2.0
2.11k stars 597 forks source link

CPUThrottlingHigh false positives #108

Open cbeneke opened 6 years ago

cbeneke commented 6 years ago

Hi,

since the Alert CPUThrottlingHigh got added, it is firing in my Cluster for a lot of pods. As most of the affected pods are not even at their assigned CPU limit, I assume the expression for the alert is wrong (either miscalculation or [what seems to be more likely] container_cpu_cfs_throttled_periods_total includes different types of throttle).

This needs further investigating to be sure where this comes from, but like this the alert is not useful. (With about 250 pods running and a 25% limit I observe >100 alerts, 50% limit ~20 alerts.)

metalmatze commented 6 years ago

I have the same situation on my personal cluster with the overall load (1 - avg(rate(node_cpu{mode="idle"}[1m]))) being at ~20%.

/cc @gouthamve

metalmatze commented 6 years ago

It would be nice to actually debug this on the CFS (completly fail scheduler) layer. Sadly, I have not clue how to do that, but others might. Anyone? :relaxed:

tomwilkie commented 6 years ago

Sorry for slow response. This alert was added exactly for this reason: with low limits, spiky workloads can have low averages and still be being throttled. Consider this: if we sample every 15s, and do a rate[1m], even 12 seconds of maxed out CPU will appear as 20% CPU utilisation.

What we've found is raising our limits on container CPU (whilst keeping container CPU requests close to 95%-ile "average" usage*) has allowed us to have lower throttling and decent utilisation.

If you don't want this, you can set the threshold to something >25% in the _config field.

gouthamve commented 6 years ago

I've found this issue here: https://github.com/kubernetes/kubernetes/issues/67577

I might dig into this later though. See this for some more info: https://twitter.com/putadent/status/1047808685840334848

metalmatze commented 6 years ago

Alright. Thanks a bunch for the further info. I'll look into that for my personal cluster and try to get a better idea.

cbeneke commented 6 years ago

Hmm, thats interesting. But from what I read thats actually a bug in the kernel/cfs. Especially taking https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1 in mind, spiky workloads are throttled for no reason. I'm not getting how to mitigate that alert though. Afaict the only real mitigation is to just disable limits (which is not really an option). Question here: How should the Alert be helpful then? I have pods running at 90-95% CPU throttling (regarding to this calculation) which do calculations only once a minute: they are running at their CPU limit for 3-5 seconds and do nothing the rest of the time.

Imho the alert is more misleading / too trigger friendly, as long as the mentioned bug(s) is/are not fixed (Thanks for linking those)

metalmatze commented 6 years ago

Talking to @gouthamve again, I am now running this alert in my cluster with cpuThrottlingPercent: 50. Having bumped the node-exporter cpu limits from 102m to 200m the alert isn't trigger happy anymore. https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/f7ca48cca5d9cadc9a2203b8c0b3bb3eb85f3294/config.libsonnet#L44

Therefore the question: Should we set cpuThrottlingPercent in these mixins to 50, 60, or even 75 by default? What do you think?

szymonpk commented 6 years ago

@metalmatze I think your process is still throttled and it may affect its performance. So it is just hiding the real issue.

metalmatze commented 6 years ago

Sure. What do you propose instead @szymonpk? My comment was more about people silencing or removing this alert completely at the moment and how to temporarily mitigate that. :slightly_smiling_face:

szymonpk commented 6 years ago

@metalmatze Disabling cfs-quota or removing cpu limits for containers with small limits and spiky workloads.

dkozlov commented 5 years ago

https://github.com/helm/charts/issues/14801

bgagnon commented 4 years ago

@chiluk's recent talk at KubeCon19 revealed all the intricate details of CFS and throttling. Details about the kernel patch are now widely documented (see kubernetes/kubernetes#67577), but the bit that caught my eye is that calculating the throttling percentage based on seconds is apparently wrong:

Throttling seconds accumulate on the counter for every running thread. As such, one cannot come up with a percentage value without also knowing the number of threads at time. Instead, the alert should be based on the ratio of periods, which are global for all cores, not seconds.

I'm thinking the alert in this repo should be changed in that direction. Thoughts?

benjaminhuo commented 4 years ago

@bgagnon this alert is based on period already, I don't think we need to change that . please refer to https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/resource_alerts.libsonnet#L108

chiluk commented 4 years ago

The math @benjaminhuo pointed at looks correct. I suspect @bgagnon is probably hitting the inadvertent throttling covered in my talk resulting in the increased throttled period percentages he's seeing. I suspect installing kernels with the fixes will likely alleviate some of the throttling such that the monitor can be decreased. Hopefully if these patches ever get accepted, bursty applications can have tighter limits with decreased throttling.

bgagnon commented 4 years ago

Thanks @benjaminhuo and @chiluk, I must have misread the alert definition!

omerlh commented 4 years ago

We actually just deployed a cluster with the fix (kernel version 4.14.154-128.181.amzn2.x86_64) and still seeing the same issue with node exporter:

image

While the actual CPU usage is very low:

image

I think there is another issue because the actual usage is very low - ~6% from the request.

chiluk commented 4 years ago

Am I reading it correctly that your CPU requests and Cpu Limit are set to .07 and .08 respectively? Think about what is going on here. Whenever your application is runnable it is only able to execute for 8ms every 100ms on only 1 CPU before it hits throttling. Assuming a 3ghz CPU clock this is similar to giving your application a 210mhz single-core cpu *(this reminds me of my days back in the 90's with a Cyrix 166+).

Depending on what it is or isn't doing the in-kernel context switch time could potentially be that expensive without your application doing anything *(you can thank spectre/meltdown for that). Basically your Requests are Limits are bounded too tightly. They are set well below the threshold that can be reasonably accounted for reliably by the kernel with useful results.

I don't know what the minimum limit should be set to, but I do think you are well below that based on the throttling percentages you are seeing. This issue is solved. Your expectations of what can be reasonably accomplished with existing kernel constructs and hardware need to be re-evaluated.

omerlh commented 4 years ago

Thank you for your detailed information. I should have thought about that...

At first I used the default kube-prometheus default resources (102m request/250 limit), but I still experienced throttling. So I increased resources to 800m, which solved the issue - but noticed node-exporter does not use it (it needs something ~0.1m). So I reduce the resources back - and now I'm fighting with this alert.

AndrewSav commented 4 years ago

So does this explanation mean that the default values set by kube-prometeus are non-sesnsical / wrong?

omerlh commented 4 years ago

I don't know to be honest... there is a discussion here coreos/kube-prometheus#214.

It looks like the values make sense - node exporter almost not using any cpu, so giving it very little resources is reasonable - but I do wonder why it's still get throttled...

cbeneke commented 4 years ago

Correct me if I'm wrong: As far as I understand the node_exporter actually only uses up CPU cycles when being scraped (which for a default prometheus setting should be every 15 or 30 seconds). This means on average the pod does have a very flat line of CPU usage. But the problem is, that the container_cpu_cfs_periods_total only increases, when the pod actually uses CPU (On my private cluster I see value increases of somewhat around 4-12 period counts per scrape (Which equals 0.4-1.2 seconds running time). Since the container_cpu_cfs_throttled_periods_total increases almost equally, as the pod is - when running - hitting the throttling limit on almost every period the alert is firing.

To be honest, I have no Idea how to build general-approach alerts for this then (or if it even is relevant), since it hardly depends on the application. In case of node-exporter it should be irrelevant (since prometheus doesn't care about some overhead in scraping and I've seen no cluster yet, where the node_exporter scrape was slowed to more than 3 seconds)

omerlh commented 4 years ago

So maybe we can add a selector for excluding containers from the alerts? So users could easily ignore such containers?

chiluk commented 4 years ago

We do not use prometheus, so I don't know about the default values etc. @cbeneke has the right idea. Since the pod only uses cpu sporadically it always hits throttling when it is running. If you don't care about the response times of this pod or how long it takes to "gather metrics" I would leave requests where they are, and increase the limit until you no longer see throttling. That way the pod would only be scheduled by the kernel when nothing else is able to run. This is similar to how we schedule our batch jobs with low requests, but high limit. That way they rarely pre-empt latency sensitive applications, but they are allowed to use a ton of cpu time that would otherwise be sacrificed away to the idle process gods.

paulfantom commented 4 years ago

Since this is a very useful alert to have, especially during debugging, it is also a very chatty one (as can be seen by the number of issues linked here). In many cases this alert is not actionable (apart from silencing it) because the application is not latency-sensitive and can work without problems even when throttled. Additionally, this alert is based on cause and not a symptom. I propose to reduce alert severity to info.

brancz commented 4 years ago

Info level severity sounds good to me

metalmatze commented 4 years ago

Everybody, please leave a review on #453. Thanks!

brancz commented 4 years ago

FYI there is also already the cpuThrottlingSelector configuration that allows you to scope or exclude certain containers/namespaces/etc.

KlavsKlavsen commented 3 years ago

issues as described here https://engineering.indeedblog.com/blog/2019/12/cpu-throttling-regression-fix/ seems to be what this alert shows

alibo commented 3 years ago

I think the title is misleading a little bit, I don't think they're false positives. Even with an updated kernel, applications still suffer from throttled cpu periods and perform much slower when many processes or threads are running at the same time (the situation is much worse for applications that handle each request in a separated thread or process such as php-fpm based apps or their average response time is more than (available quota periods in ms) / (number of threads or processes running))

For Golang apps such as node-exporter, you can set GOMAXPROCS to a lower value than node's cpu cores or use Uber's automaxprocs library to mitigate the CPU throttling issue:

https://github.com/uber-go/automaxprocs

Benchmarks: https://github.com/uber-go/automaxprocs/issues/12

irizzant commented 3 years ago

I totally agree with @alibo , this is not misleading. I initially disabled the alert and then found myself hunting down the reason for extremely slow and/or failing pods!

CPU throttling is a serious issue in clusters, and also blindly removing limits can cause further problems.

A very nice to have feature in dashboards would be a graph showing CPU waste, based on CPU requests.

chiluk commented 3 years ago

@irizzant and @alibo are correct. It's highly unlikely that you are receiving false positives. However it is likely that you are getting positives for very short bursts. I don't know enough about the monitor, but it might be useful to put some threshold on the monitor where it only triggers if the application is throttled for more than x% of the last many periods. I'd expect most well written applications to be throttled at some point in time. It also might be useful to be able to put such a threshold in the pod spec itself so it could be twiddled per pod. Alright that's my attempt at thought leadering here. Hopefully cgroups v2 will make some of this mess "better" without creating a whole new range of issues.

If you'd rather not read the long blog post I wrote that @KlavsKlavsen linked I also gave a topic on this subject a few years back. https://www.youtube.com/watch?v=UE7QX98-kO0

chiluk commented 3 years ago

Another possibility would be to create a kernel scheduler config such that runnable throttled applications would receive run time when the idle process would otherwise be run. That might really muddy the accounting metrics in the kernel, and would probably take a herculean effort to get scheduler dev approval.

alibo commented 3 years ago

Another possibility would be to create a kernel scheduler config such that runnable throttled applications would receive run time when the idle process would otherwise be run. That might really muddy the accounting metrics in the kernel, and would probably take a herculean effort to get scheduler dev approval.

@chiluk The burstable CFS controller is introduced in Kernel 5.14 (it's not released yet!) can mitigate this issue a little bit, specially improves P90+ of response time a lot based on the benchmarks are provided:

however, it's not implemented in CRI-based container runtimes yet:

AndrewSav commented 3 years ago

I guess we need to define what we call "false positive" here. IMO a false positive in this context is an alert that is not actionable, e.g. not indicative of a real problem that requires an action. So far I was not able to deduce why those alerts randomly trigger and disappear many times a day and how they help me.

paulfantom commented 3 years ago

I guess we need to define what we call "false positive" here. IMO a false positive in this context is an alert that is not actionable, e.g. not indicative of a real problem that requires an action.

In that context, and when application is not experiencing any issues manifested with other alerts, it is a false-positive. For exactly this reason CPUThrottlingHigh is shipped with severity: info and not warning nor critical. The idea is that with latest alertmanager you can configure inhibition rule to prevent alert from firing when there is no other alert firing for the specified label sets. Issue in kube-prometheus has a bit more details: https://github.com/prometheus-operator/kube-prometheus/issues/861.

aantn commented 2 years ago

This isn't a false alarm and it isn't due to CFS kernel bugs!

I've written a whole wiki page on this and how to respond to each subset of this alert

The gist of it is that processes are being deprived CPU when they need it and that can happen even when CPU is available. I know people consider it a best practice to set CPU limits, but if you use CPU requests for everything then the simple and safe action here is to remove the limits.

Unless this is happening on metrics-server in which case it's a whole different story...

aantn commented 11 months ago

In case anyone is curious, I'll elaborate on the previous comment with a specific example:

Here's a numerical example of throttling when average CPU is far below the limit.

Assumptions

  1. An http server handles one http request per second, which takes 30 milliseconds to handle
  2. The server runs inside a container with request=limit=130m
  3. The node's kernel parameters are default - specifically, the CFS scheduling period is configured as 100ms

Outcome

  1. Average CPU of 30% (the server runs 30 milliseconds every second)
  2. The server is allowed to run for 13 consecutive milliseconds every 100 milliseconds (a limit of 130m is 13% of a CPU-second. With a CFS period of 100ms that means 13% of each CFS period - i.e. 13ms)
  3. When the server gets a request it needs to run for 30 milliseconds. In the first CFS period it runs for 13ms then waits 87 ms for the period to end. In the second CFS period, the same. In the third period it runs the remaining 4ms and finishes running.
  4. Hence, the server was throttled in 2 out of 3 CFS periods it ran in. Therefore it was throttled 66% of the time.
  5. The Prometheus alert CPUThrottlingHigh fires. (The metrics it uses are taken from kernel stats nr_throttled/nr_periods)

This is not a false positive alert. There is a real user facing impact. A server is getting one http request per second. It should take only 30ms to handle it. Yet that request takes 204ms instead! There was real latency introduced here. Performance got worse by 6.8x. Despite the pod having a limit of 130m which is far above average CPU of 3%.

In short, as always, remove those darn limits if you can.

chiluk commented 11 months ago

@aantn understands.

However, removing the limits is not strictly "safe" if you have untrustworthy apps or poor developers.

levsha commented 11 months ago

@aantn understands.

However, removing the limits is not strictly "safe" if you have untrustworthy apps or poor developers.

How exactly is this unsafe?

chiluk commented 11 months ago

Without limits, a misbehaving or crashlooping application can theoretically eat all the available CPU which would adversely affect performance of other applications on the system. Even when request operating correctly as a minimum guarantee, an application using 100% of all cores of a CPU would cause thermal throttling on the CPU itself which can lead to lower performance for collocated behaving applications. Additionally it might cause a scheduling delay of a behaving application for a cpu time slice (~5ms).

For this reason, my recommendation for interactive/request servicing applications is to set cpu limits large enough so as to avoid throttling, but not so large that a misbehaving application can eat an entire box.