ncabatoff / process-exporter

Prometheus exporter that mines /proc to report on selected processes
MIT License
1.72k stars 270 forks source link

namedprocess_namegroup_context_switches_total counter is decreasing #193

Open ngosang opened 3 years ago

ngosang commented 3 years ago

v0.7.5 The metric namedprocess_namegroup_context_switches_total is declared ad counter as it should be. Most of the time the value increases but not always. This was causing me a lot of issues.

In this image you can see how the value increases and decreases. I think this only happens in some processes with many context switchers. In this case I'm able to reproduce in 2 Mono apps in Linux (Sonarr and Radarr). image

When I apply the ratefunction the graph is a mess due to negative values in the vector. image

By now I fixed it using deriv function instead of rate. This graph is mostly accurate. image

How are you getting the context switches? How it's possible that value decreases? How can I help?

lawsontyler commented 3 years ago

I'm also seeing this issue, I assume it's because the exporter is doing a straight sum() of all the matching processes without any history.

For example, let's assume we have a process that accepts network connections. The main process spawns 2 sub-processes. Each subprocess will handle 1000 requests and then terminate itself, causing the main process to spawn new processes to replace it.

In the beginning you might have 3 PIDs: 10, 20, 30. At a time, T0, they all start at 0 context switches.

@ T1
PID 10 - 100 switches
PID 20 -  10 switches
PID 30 -  10 switches
SUM    = 120 switches

@ T2
PID 10 -  150 switches
PID 20 - 1000 switches
PID 30 - 2000 switches
SUM    = 3150 switches
...etc.

Now, what happens when one of the processes die and is replaced?

@ TN
PID 10 -  160 switches
PID 20 - 1200 switches
PID 40 - 0 switches
SUM    = 1180 switches

Oops...the number of context switches went down!

This has produced an interesting result for us, where it looks like the context switching is constantly accelerating for our long-running processes, since PID 10 constantly increasing and the rate() function in Prometheus thinks that it's resetting all the time.

I'm not sure how this should be solved, however - adding the PID would generate high-cardinality.