prometheus-community / windows_exporter

Prometheus exporter for Windows machines
MIT License
2.87k stars 685 forks source link

Add "percent processor time" to CPU collector #53

Closed pokemane closed 7 years ago

pokemane commented 7 years ago

Currently, Percent Processor Time is in the Win32_PerfRawData_PerfOS_Processor struct but is not actually collected. Is that something that can be added? Judging by how the collector is laid out (I'm not super familiar with Go so I might be wrong) it looks to be a pretty simple addition. Was it just omitted by accident in #26 ?

pokemane commented 7 years ago

Meant to add-- According to the MSDN docs, Percent Processor Time is:

Percentage of time that the processor is executing a non-idle thread. This property was designed as a primary indicator of processor activity.

brian-brazil commented 7 years ago

wmi_cpu_time_total should meet that need.

pokemane commented 7 years ago

How so? You either include the metric (percentage) or have a continuous query or set of queries to get a percentage instead. In Prometheus they'd both take up the same disk space if you're using a recording rule. I understand the formula to do it but it's hardly more efficient to not include it versus having to make a recording rule to make the value human-readable. cpu_time_total is not exactly human-readable as it is currently

brian-brazil commented 7 years ago

You need queries either way to get to this data.

pokemane commented 7 years ago

I'm not sure I understand your point. I know they would both need a queries, my question is about the relative cost on the collector vs on the prometheus instance. Could you give an example of the queries I'd need either way, in case I'm missing something about the complexity of getting an instantaneous usage utilization percentage from at least 4 different series (idle, user, interrupt, dpc, privileged) ?

brian-brazil commented 7 years ago

I don't have a Windows system to hand, but it'd be similar to Linux: https://www.robustperception.io/understanding-machine-cpu-usage/

pokemane commented 7 years ago

Alright, I followed that and am now getting negative CPU usage: image

Also, 100 - (avg (irate(wmi_cpu_time_total{mode="idle"}[5m])) * 100) vs avg (wmi_cpu_percentage) is what I was talking about w.r.t. relative complexity on both the user and server side vs the system call for the collector. The collector already gets the data when it executes the wmi query, and the relative cost of disk space or query execution on the prometheus server vs the already-collected-but-not-published data is negligible

Also also, the magnitude of the spikes in this screenshot do not match my procexp monitoring of the system in question. procexp reports a max of about 30%, wmi exporter is 75%. Probably due to the usage of irate with the "idle" mode instead of its complement (for the negative-positive swing) and the fact that it crosses zero at all is due to it being a rate of change in counters.

carlpett commented 7 years ago

One potential idea: Do you have more than one server with wmi_exporter installed? You are averaging over all instances, rather than a specific one. For comparison, I have 1 - avg(irate(wmi_cpu_time_total{instance="$Server", mode="idle"}[2m])) in our dashboards (with Grafana templating filling in $Server).

Could you check what data is in prometheus for just wmi_cpu_time_total{mode="idle"} and share?

pokemane commented 7 years ago

Nope, only one instance.

carlpett commented 7 years ago

Ok, thanks. What are the specs of the instance - number of cores and version of Windows?

pokemane commented 7 years ago

W7 enterprise, 40 cores. (I know.. I don't ask questions when I get something like this where I work)

carlpett commented 7 years ago

Hm, ok, I'll admit I haven't got anything like that to test on ;) But could you try to run the query on prometheus so we can look at the raw data?

pokemane commented 7 years ago

I would, if my "graph" panel worked. I updated recently, v1.4.1, and the "add graph" hasn't worked for me since. Are you just looking for the returned series and a quick snapshot of what they look like?

carlpett commented 7 years ago

Basically yes, just to see if the data seems unreasonable in itself (ie we are not interpreting the wmi output correctly). Doesn't need to be in the graph mode either if that is acting up, just raw query output might be enough.

pokemane commented 7 years ago

yeah both the graphs and table aren't working properly at all (don't even have a query input) so here's the grafana table equivalent, columns are cores:

image

complete with ascii-sorted columns because no zero padding :D

pokemane commented 7 years ago

Ah, makes sense now:

https://github.com/prometheus/prometheus/issues/2265

after a hard refresh: image

carlpett commented 7 years ago

Hm, nothing obvious there. Could you try modifying your Grafana query like this 100 - (avg (irate(wmi_cpu_time_total{mode="idle"}[5m])) by (core) * 100) and using bar mode instead of lines? I'm expecting to see that a single core is responsible for the weird jumps.

brian-brazil commented 7 years ago

avg (wmi_cpu_percentage)

This also requires a rate, as that's how the raw data is exposed.

pokemane commented 7 years ago

@carlpett it's actually a bunch of them, weirdly enough: image

@brian-brazil how is it exposed as a rate? From the docs I was under the impression that it was a 0-100% depending on wmi's own sampling rate, using a similar process to the one you linked to above.

brian-brazil commented 7 years ago

All WMI stats come with equations to use to convert them into what their actual name is. So in the case of a CPU percentage that's going to mean taking a rate.

pokemane commented 7 years ago

So it's returning a cumulative percentage...? I'm sorry, I understand how the rate applies on the query side for cpu total time to convert to an instant value, but not how it applies for an instantaneous value to stay an instantaneous value. The total time returns a cumulative counter, which matches the name, why would the present utilization be different?

On Jan 20, 2017 12:50 PM, "Brian Brazil" notifications@github.com wrote:

All WMI stats come with equations to use to convert them into what their actual name is. So in the case of a CPU percentage that's going to mean taking a rate.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/martinlindhe/wmi_exporter/issues/53#issuecomment-274134925, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5zjsd7AhmJfMvY3pLiaFb4gxNmTTGEks5rUPPYgaJpZM4LpWO_ .

brian-brazil commented 7 years ago

The total time returns a cumulative counter, which matches the name, why would the present utilization be different?

It's no different. The percent utilisation is exposed as a cumulative counter.

pokemane commented 7 years ago

Oh, so it doesn't match its name where the former does. I understand now... Sorry for the confusion. In that case, yeah, it's a pretty useless stat and only saves a single math operation and not even an aggregator.

On Jan 20, 2017 1:01 PM, "Brian Brazil" notifications@github.com wrote:

The total time returns a cumulative counter, which matches the name, why would the present utilization be different?

It's no different. The percent utilisation is exposed as a cumulative counter.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/martinlindhe/wmi_exporter/issues/53#issuecomment-274137635, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5zjkECxRt6kNidTZ424XnO_Fvr9yguks5rUPZhgaJpZM4LpWO_ .

carlpett commented 7 years ago

Regarding the graph, I'm unsure how to interpret this. @brian-brazil - could this be an artifact from the irate of some sort? I've seen similar things when using deriv and predict_linear when there are very large changes in a value between two scrapes, which would be expected, but I'm not sure how irate could cause this? Especially the negative spike preceeding the positive one.

brian-brazil commented 7 years ago

could this be an artifact from the irate of some sort?

It's not unusual to see oddness of this sort due to various races.

pokemane commented 7 years ago

If the entirety of the CPU raw data collector is basically cumulative counters, then this issue doesn't propose anything new except what amounts to a macro for some simple math.

Krishna1408 commented 5 years ago

Hi @pokemane I am new to windows and due to mssql dependency I have to setup wmi exporter. How did you set up the rules ? I am not able to understand what to do from this discussion. can you please help ? or can you point to example rules for wmi or mssql ?