Open ngosang opened 3 years ago
I'm also seeing this issue, I assume it's because the exporter is doing a straight sum() of all the matching processes without any history.
For example, let's assume we have a process that accepts network connections. The main process spawns 2 sub-processes. Each subprocess will handle 1000 requests and then terminate itself, causing the main process to spawn new processes to replace it.
In the beginning you might have 3 PIDs: 10, 20, 30. At a time, T0
, they all start at 0 context switches.
@ T1
PID 10 - 100 switches
PID 20 - 10 switches
PID 30 - 10 switches
SUM = 120 switches
@ T2
PID 10 - 150 switches
PID 20 - 1000 switches
PID 30 - 2000 switches
SUM = 3150 switches
...etc.
Now, what happens when one of the processes die and is replaced?
@ TN
PID 10 - 160 switches
PID 20 - 1200 switches
PID 40 - 0 switches
SUM = 1180 switches
Oops...the number of context switches went down!
This has produced an interesting result for us, where it looks like the context switching is constantly accelerating for our long-running processes, since PID 10 constantly increasing and the rate()
function in Prometheus thinks that it's resetting all the time.
I'm not sure how this should be solved, however - adding the PID would generate high-cardinality.
v0.7.5 The metric
namedprocess_namegroup_context_switches_total
is declared ad counter as it should be. Most of the time the value increases but not always. This was causing me a lot of issues.In this image you can see how the value increases and decreases. I think this only happens in some processes with many context switchers. In this case I'm able to reproduce in 2 Mono apps in Linux (Sonarr and Radarr).
When I apply the
rate
function the graph is a mess due to negative values in the vector.By now I fixed it using
deriv
function instead ofrate
. This graph is mostly accurate.How are you getting the context switches? How it's possible that value decreases? How can I help?