Closed pokemane closed 7 years ago
Meant to add-- According to the MSDN docs, Percent Processor Time is:
Percentage of time that the processor is executing a non-idle thread. This property was designed as a primary indicator of processor activity.
wmi_cpu_time_total should meet that need.
How so? You either include the metric (percentage) or have a continuous query or set of queries to get a percentage instead. In Prometheus they'd both take up the same disk space if you're using a recording rule. I understand the formula to do it but it's hardly more efficient to not include it versus having to make a recording rule to make the value human-readable. cpu_time_total
is not exactly human-readable as it is currently
You need queries either way to get to this data.
I'm not sure I understand your point. I know they would both need a queries, my question is about the relative cost on the collector vs on the prometheus instance. Could you give an example of the queries I'd need either way, in case I'm missing something about the complexity of getting an instantaneous usage utilization percentage from at least 4 different series (idle, user, interrupt, dpc, privileged) ?
I don't have a Windows system to hand, but it'd be similar to Linux: https://www.robustperception.io/understanding-machine-cpu-usage/
Alright, I followed that and am now getting negative CPU usage:
Also, 100 - (avg (irate(wmi_cpu_time_total{mode="idle"}[5m])) * 100)
vs avg (wmi_cpu_percentage)
is what I was talking about w.r.t. relative complexity on both the user and server side vs the system call for the collector. The collector already gets the data when it executes the wmi query, and the relative cost of disk space or query execution on the prometheus server vs the already-collected-but-not-published data is negligible
Also also, the magnitude of the spikes in this screenshot do not match my procexp monitoring of the system in question. procexp reports a max of about 30%, wmi exporter is 75%. Probably due to the usage of irate
with the "idle" mode instead of its complement (for the negative-positive swing) and the fact that it crosses zero at all is due to it being a rate of change in counters.
One potential idea: Do you have more than one server with wmi_exporter installed? You are averaging over all instances, rather than a specific one. For comparison, I have 1 - avg(irate(wmi_cpu_time_total{instance="$Server", mode="idle"}[2m]))
in our dashboards (with Grafana templating filling in $Server
).
Could you check what data is in prometheus for just wmi_cpu_time_total{mode="idle"}
and share?
Nope, only one instance.
Ok, thanks. What are the specs of the instance - number of cores and version of Windows?
W7 enterprise, 40 cores. (I know.. I don't ask questions when I get something like this where I work)
Hm, ok, I'll admit I haven't got anything like that to test on ;) But could you try to run the query on prometheus so we can look at the raw data?
I would, if my "graph" panel worked. I updated recently, v1.4.1, and the "add graph" hasn't worked for me since. Are you just looking for the returned series and a quick snapshot of what they look like?
Basically yes, just to see if the data seems unreasonable in itself (ie we are not interpreting the wmi output correctly). Doesn't need to be in the graph mode either if that is acting up, just raw query output might be enough.
yeah both the graphs and table aren't working properly at all (don't even have a query input) so here's the grafana table equivalent, columns are cores:
complete with ascii-sorted columns because no zero padding :D
Hm, nothing obvious there. Could you try modifying your Grafana query like this 100 - (avg (irate(wmi_cpu_time_total{mode="idle"}[5m])) by (core) * 100)
and using bar mode instead of lines? I'm expecting to see that a single core is responsible for the weird jumps.
avg (wmi_cpu_percentage)
This also requires a rate, as that's how the raw data is exposed.
@carlpett it's actually a bunch of them, weirdly enough:
@brian-brazil how is it exposed as a rate? From the docs I was under the impression that it was a 0-100% depending on wmi's own sampling rate, using a similar process to the one you linked to above.
All WMI stats come with equations to use to convert them into what their actual name is. So in the case of a CPU percentage that's going to mean taking a rate.
So it's returning a cumulative percentage...? I'm sorry, I understand how the rate applies on the query side for cpu total time to convert to an instant value, but not how it applies for an instantaneous value to stay an instantaneous value. The total time returns a cumulative counter, which matches the name, why would the present utilization be different?
On Jan 20, 2017 12:50 PM, "Brian Brazil" notifications@github.com wrote:
All WMI stats come with equations to use to convert them into what their actual name is. So in the case of a CPU percentage that's going to mean taking a rate.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/martinlindhe/wmi_exporter/issues/53#issuecomment-274134925, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5zjsd7AhmJfMvY3pLiaFb4gxNmTTGEks5rUPPYgaJpZM4LpWO_ .
The total time returns a cumulative counter, which matches the name, why would the present utilization be different?
It's no different. The percent utilisation is exposed as a cumulative counter.
Oh, so it doesn't match its name where the former does. I understand now... Sorry for the confusion. In that case, yeah, it's a pretty useless stat and only saves a single math operation and not even an aggregator.
On Jan 20, 2017 1:01 PM, "Brian Brazil" notifications@github.com wrote:
The total time returns a cumulative counter, which matches the name, why would the present utilization be different?
It's no different. The percent utilisation is exposed as a cumulative counter.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/martinlindhe/wmi_exporter/issues/53#issuecomment-274137635, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5zjkECxRt6kNidTZ424XnO_Fvr9yguks5rUPZhgaJpZM4LpWO_ .
Regarding the graph, I'm unsure how to interpret this.
@brian-brazil - could this be an artifact from the irate
of some sort? I've seen similar things when using deriv
and predict_linear
when there are very large changes in a value between two scrapes, which would be expected, but I'm not sure how irate
could cause this? Especially the negative spike preceeding the positive one.
could this be an artifact from the irate of some sort?
It's not unusual to see oddness of this sort due to various races.
If the entirety of the CPU raw data collector is basically cumulative counters, then this issue doesn't propose anything new except what amounts to a macro for some simple math.
Hi @pokemane I am new to windows and due to mssql dependency I have to setup wmi exporter. How did you set up the rules ? I am not able to understand what to do from this discussion. can you please help ? or can you point to example rules for wmi or mssql ?
Currently, Percent Processor Time is in the
Win32_PerfRawData_PerfOS_Processor
struct but is not actually collected. Is that something that can be added? Judging by how the collector is laid out (I'm not super familiar with Go so I might be wrong) it looks to be a pretty simple addition. Was it just omitted by accident in #26 ?