nicolargo / glances

Glances an Eye on your system. A top/htop alternative for GNU/Linux, BSD, Mac OS and Windows operating systems.
http://nicolargo.github.io/glances/
Other
26.51k stars 1.52k forks source link

WARNING (or CRITICAL) on CPU_IOWAIT #2330

Closed maravento closed 1 year ago

maravento commented 1 year ago

Describe the bug WARNING (or CRITICAL) on CPU_IOWAIT

To Reproduce Steps to reproduce the behavior:

sudo apt install glances

sudo /etc/init.d/glances status
[sudo] contraseña para adminred: 
● glances.service - Glances
     Loaded: loaded (/lib/systemd/system/glances.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2023-04-04 19:37:48 -05; 12h ago
       Docs: man:glances(1)
             https://github.com/nicolargo/glances
   Main PID: 1004035 (glances)
      Tasks: 1 (limit: 18893)
     Memory: 88.3M
        CPU: 20min 38.361s
     CGroup: /system.slice/glances.service
             └─1004035 /usr/bin/python3 /usr/bin/glances -w -B 127.0.0.1 -t 10
abr 04 19:37:48 adminred systemd[1]: Started Glances.

Screenshots critical

Desktop (please complete the following information):

Additional context

Can someone explain clearly what these alerts mean and how do i fix it? Thk

RazCrimson commented 1 year ago

@maravento IO Wait - Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

IO Wait is a part of idle time (CPU didn't do anything) due to out standing IO. That is, while some IO transfer was progressing, the CPU was not able to schedule/execute any tasks and was idle. More detailed explanation is here

Having a high IO wait could mean that CPU is throttled due to IO transfers. But this is not a "bad" situation when it happens in servers or PCs with HDDs. In those cases, it is normal to have higher IO wait time.

Now coming to the alerts, they mean that at some earlier points in time, more than 20% of CPU time was spent idle with some IO operation happening in background. The values in the alerts specify the IO Wait time when the spike occurred.

To modify and change the thresholds for the alerts, you can specify the exact thresholds in the config file.Ref: https://glances.readthedocs.io/en/latest/config.html

The exact values depend upon what kind of workload is happening on your system and if the workload can cause heavy IO operations. Depending on you needs, decide on the threshold values. The values mentioned in the docs, are quite in line with heavy IO systems too. The default value calculations is a bit complicated and is mentioned here

Here is a quick snip that you can drop in: ~/.config/glances/glances.conf

[cpu]
iowait_careful=50
iowait_warning=70
iowait_critical=90
RazCrimson commented 1 year ago

On an another note, we could probably apply some limit to the logic for the default IO wait threshold computation.

For example, for IO critical default threshold, we could do max(30, 100/#cores) instead of just 100/#cores. So we would use core-count based logic up to a limit but cap the threshold at a fixed value when it becomes very small.

What do you think @nicolargo ?

Also this probably needs better documentation, rather than being mentioned in the config file example.

maravento commented 1 year ago

@maravento IO Wait - Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

IO Wait is a part of idle time (CPU didn't do anything) due to out standing IO. That is, while some IO transfer was progressing, the CPU was not able to schedule/execute any tasks and was idle. More detailed explanation is here

Having a high IO wait could mean that CPU is throttled due to IO transfers. But this is not a "bad" situation when it happens in servers or PCs with HDDs. In those cases, it is normal to have higher IO wait time.

Now coming to the alerts, they mean that at some earlier points in time, more than 20% of CPU time was spent idle with some IO operation happening in background. The values in the alerts specify the IO Wait time when the spike occurred.

To modify and change the thresholds for the alerts, you can specify the exact thresholds in the config file.Ref: https://glances.readthedocs.io/en/latest/config.html

The exact values depend upon what kind of workload is happening on your system and if the workload can cause heavy IO operations. Depending on you needs, decide on the threshold values. The values mentioned in the docs, are quite in line with heavy IO systems too. The default value calculations is a bit complicated and is mentioned here

Here is a quick snip that you can drop in: ~/.config/glances/glances.conf

[cpu]
iowait_careful=50
iowait_warning=70
iowait_critical=90

The drives are SSD. Not HDD. Also, there are alerts related to network interfaces, which you don't mention in your answer

To be honest, your explanation is not very clear. But you don't need to explain further, because I don't see that adding something to your explanation will fix the problem. So, I think, if it's not a bad thing (as you claim) and if warnings can be ignored (as you claim), then it's better to turn them off. Please tell me where they are disabled. Thank you

RazCrimson commented 1 year ago

To be honest, your explanation is not very clear. But you don't need to explain further, because I don't see that adding something to your explanation will fix the problem.

TLDR: The amount of time that the CPU was idle when some IO transfer was taking place, is called the IO wait time.

I'm not sure how better I could explain it. If you want a more detailed explanation, check the severfault question linked above.

According to what I understood you, this is not a problem and can be ignored. But, what I don't understand is that if these warnings are nothing bad and can be ignored, then why are they there? (to make the life of the sysadmin more difficult?)

The problem is different workloads (tasks) can have different IO wait time and the threshold needs to be varied to match your case.

An example that could let you understand why its hard to set an exact value for IO wait time threshold:

Desktop do different things and Servers do different things. Even servers can do different things, like a File Hosting servers that just does IO tasks (reading and transferring files) or a compute intensive servers that does complex modelling/simulation or maybe even running some ML algos etc.

In a File Hosting server, it is normal to have heavy IO tasks, that is, a higher IO wait time is expected and you dont really might the IO wait times in this case.

Whereas in compute servers, the objective is to perform the task as fast as possible, we want a low IO wait time. In that case we would be interested in IO wait time as IO might be cause of slowing down/throttling. Consider scenario for a costly compute optimized VM in the cloud. You don't want the VM to be wasting all its precious CPU cycles waiting on some IO operation when it could be doing some other compute, rather than wasting $$$ (VMs can be shutdown or down-scaled depending on usage to save $$$). So the alerts are preferred in this case.

Concluding, defining a standard IO wait time threshold suitable for all cases is not possible, so we just have some preset defaults that work for most cases* (logic could be better as explained in previous comment).

Users (sysadmins in the current case) can change it according to their needs depending on their workloads or what would work for their case. Not all users have the same needs.

So, I think, if it's not a bad thing (as you claim) and if warnings can be ignored (as you claim), then it's better to turn them off.

It's very case specific, so I think its better to defer the choice of disabling it to the user.

Please tell me where they are disabled.

Its only possible to disable plugins not specific alerts. The CPU plugin, which gives the IO wait time alert, has other alerts which you would probably not want to miss. So disabling the plugin is not a good idea.

If you just want the IO wait time alerts to not pop up, you can set their thresholds (as explained above) to 100. This will practically disable them.

maravento commented 1 year ago

Thanks

PD: we have published a post about this application https://www.maravento.com/2023/04/glances.html