Report expected vs. actual samples count to highlight low CPU usage/availability to application

jvm-profiling-tools / honest-profiler

A sampling JVM profiler without the safepoint sample bias

https://github.com/RichardWarburton/honest-profiler/wiki

MIT License

1.25k stars 146 forks source link

Report expected vs. actual samples count to highlight low CPU usage/availability to application #126

Open ceeaspb opened 8 years ago

ceeaspb commented 8 years ago

Low number of samples likely reflects low cpu usage or a short profiling duration, etc.

But by reporting the cpu cost as a percentage we could lose sight of this (low sample number).

This warning is presented in other sampling profilers (ie. typical / covers off a profiling risk).

How many samples is low? how long is a piece of string? TBC

nitsanw commented 8 years ago

There's 2 problems here:

Missing samples: The timer fires an interrupt but the JVM is not actually on CPU, so is not interrupted. I'm not sure where this interrupts go. We currently have no expectation on the number of samples, but we could start by logging the intended sampling rate and compare to that.
Failed samples: The JVM is running, a thread is interrupted and for any reason sampling fails. I feel this is covered by #97 .

ceeaspb commented 8 years ago

@nitsanw 1) it looks like SIGPROF will only ever fire when a thread is on cpu. It counts down as user and system cpu time elapses. So if the process has no threads on cpu then the timer will not expire and a signal not sent when off cpu. So I don't think we can "miss" a SIGPROF interrupt so to speak (would be good to check). They may just never occur if the whole jvm is off cpu forever.

Other timers: SIGALRM - counts down real time SIGVTALRM - counts down user cpu time and not system cpu time. Which makes me think of something else but I think my ticket quota today is exceeded.

2) agree this ticket isn't about failed samples.

probably needs more thought. the risk is you are led to optimise for cpu when you should be looking at off cpu time (in which case you need a different tool).

nitsanw commented 8 years ago

From what you say I take that the time will expire, but the SIGPROF will get swallowed. This matches my observations looking at HP logs. But I would need to experiment more to validate. Discovering the number of actual samples vs. wanted samples is easy by logging the timer thread counter of signals emitted and comparing to number of log entries.

nitsanw commented 8 years ago

This implies profiler stop sequence should be terminate timer, drain logger, log number of timer emitted signals.

nitsanw commented 8 years ago

Rephrasing title to reflect discussion

nitsanw commented 8 years ago

@ceeaspb reading the code more closely it looks like I misunderstood how the notification timer hangs together. In particular we rely on: http://man7.org/linux/man-pages/man2/setitimer.2.html

ITIMER_VIRTUAL decrements only when the process is executing, and
                  delivers SIGVTALRM upon expiration.
ITIMER_PROF    decrements both when the process executes and when the
                  system is executing on behalf of the process.  Coupled
                  with ITIMER_VIRTUAL, this timer is usually used to
                  profile the time spent by the application in user and
                  kernel space.  SIGPROF is delivered upon expiration.

So if no CPU is used (user or system) no signal will be delivered. This does not mean we cannot follow through on this feature. An estimate of intended sample rate should be sufficient as an indicator.