Open jaj2276 opened 6 years ago
The use of clocks can be seen in the source (e.g. https://github.com/giltene/jHiccup/blob/master/src/main/java/org/jhiccup/HiccupMeter.java#L470). It would be useful if you could include a few lines of the .hlog files around the problem area here (including a few lines before and after the 35s delta you mention). While hiccup-observation coverage is pretty good, it is not 100%. It is certainly possible (but unlikely) for a large freeze in execution to occur between the end of one measurement and the start of another, in a way that would not be captured as a hiccup, and cause a large gap in reporting times without an associated hiccup of similar magnitude being logged. e.g. at any point between line https://github.com/giltene/jHiccup/blob/master/src/main/java/org/jhiccup/HiccupMeter.java#L471 and the end of the loop.
The reasoning for originally preferring the current logic rather than more conservative one was that [originally] some potentially blocking synchronization was used between the hiccup recorder and the logging mechanism, and we did not want to incur potentially false-positive hiccup recordings due to synchronization (as opposed to an actual hiccup). However, since we've shifted to using of HdrHistogram's recorder mechanism (which guarantees wait-free behavior for the recording calls), we can probably expand timing coverage for the loop such that no hiccups could escape, without risking false positives due to synchronization. I'll consider making that change in future jHiccup versions.
Hi Gil,
Thanks for the response and apologies for my delayed reply. Holger mentioned he had lunch with you and gave you the additional details I didn't provide in the original issue. I've attached both the application's jHiccup output as well as the control's (i.e. -c) jHiccup output (the raw files before running them though the jHiccupLogProcessor).
Holger mentioned the thinking that it's likely a disk/file system issue. We're starting to look in to that but we don't see any smoking guns at the moment. The interesting thing in this case is that the jHiccup control log doesn't suffer from the same disk/file system issue. Assuming that the jHiccup log writer isn't sharing some resource with the various threads we have writing out to disk, I'm unaware why the disk/file system would block this jvm but any other jvm's/processes that are writing. jHiccup.21404.171006.1457.log
jaj2276: Thanks for pointing out the control hiccup file. I haven't forwarded that to Gil from your first message in our ticket system.
With the issue not seen on the control hiccup, I could imagine a slowed down or halted thread on the kernel level in the Linux filesystem implementation.
Do you see anything in the Linux kernel log, i.e. with "dmesg" indicating a stuck thread on the CPU?
Are you starting the JVM process from another process like taskcset, cset or others or with special libraries which might not be active then when the jHiccup control process is launched as child process?
Holger
The fact that the long recording interval (~35 seconds at time offset 4953.302 in the file) includes over 33K recorded latencies all of which are shorter than (the max of) 0.688msec is pretty clear evidence that the recording thread was not stalled during the long recording interval, and that it's the logging thread that got stalled for some reason (likely some I/O reason, since the scheduler was clearly happy to run threads when they needed to).
As for the control hiccup log: jHiccup is basically watching for and reporting on the ability of a runnable thread to run. It's not trying to identify or report on any i/o or blocking issues. As such, it's hard to tell what would cause the jHiccup logging thread to block (most likely on I/O) in one process and not in another. But that sort of thing isn't "strange". There can be many normal situations that would delay writes in one file while another keeps writing smoothly. E.g. various pressure and normal and flushing behaviors can explain such a thing (one of the files was unlucky enough to have it's cached pages evicted form the page cache, while another has not). And so sector-specific I/O delays (e.g. a bad sector on a disk, or some blocked-behind-background-GC activity inside an SSD).
Hey Holger,
I did a dmesg -T > dmesg.log and I see something about stalls.
--- [snip] --- Thu Oct 19 02:11:44 2017] INFO: rcu_sched detected stalls on CPUs/tasks: { 11 25 26 27} (detected by 21, t=60002 jiffies, g=120176517, c=120176516, q=0) --- [/snip] ---
It then proceeds to print out the backtraces of each CPU. All the CPUs' backtraces are identical except for 21 (the detector).
--- [snip] ---
[Thu Oct 19 02:11:44 2017] NMI backtrace for cpu 11
[Thu Oct 19 02:11:44 2017] CPU: 11 PID: 0 Comm: swapper/11 Tainted: G
OE ------------ 3.10.0-327.22.2.el7.x86_64 #1
[Thu Oct 19 02:11:44 2017] Hardware name: Dell Inc. PowerEdge R620/0GFKVD,
BIOS 2.0.19 08/29/2013
[Thu Oct 19 02:11:44 2017] task: ffff880fe8e24500 ti: ffff880fe8e48000
task.ti: ffff880fe8e48000
[Thu Oct 19 02:11:44 2017] RIP: 0010:[
--- [snip] ---
[Thu Oct 19 02:11:44 2017] NMI backtrace for cpu 21
[Thu Oct 19 02:11:44 2017] CPU: 21 PID: 25511 Comm: redline-inrush.
Tainted: G OE ------------ 3.10.0-327.22.2.el7.x86_64 #1
[Thu Oct 19 02:11:44 2017] Hardware name: Dell Inc. PowerEdge R620/0GFKVD,
BIOS 2.0.19 08/29/2013
[Thu Oct 19 02:11:44 2017] task: ffff880e82fc3980 ti: ffff880cee00c000
task.ti: ffff880cee00c000
[Thu Oct 19 02:11:44 2017] RIP: 0010:[
[Thu Oct 19 02:11:44 2017] [
--- [/snip] ---
It seems like I only have dmesg data going back to Oct 16th (seems dmesg uses a ring buffer for its data and it's not written out to a /var/log/* file per se) so I can't for sure say whether there was something like this when we saw the issue we've been discussing. Next time we see this happen we'll be sure to look at the dmesg log. Interestingly enough it seems we get a few of these each day and they all happen overnight (when no processes are running). Not sure if that in and of itself is a clue.
I realize this isn't an Azul Zing/jHiccup issue anymore so no worries if you've got better things to do. Appreciate your responses so far.
On Fri, Oct 13, 2017 at 7:12 PM, isenberg notifications@github.com wrote:
jaj2276: Thanks for pointing out the control hiccup file. I haven't forwarded that to Gil from your first message in our ticket system.
With the issue not seen on the control hiccup, I could imagine a slowed down or halted thread on the kernel level in the Linux filesystem implementation.
Do you see anything in the Linux kernel log, i.e. with "dmesg" indicating a stuck thread on the CPU?
Are you starting the JVM process from another process like taskcset, cset or others or with special libraries which might not be active then when the jHiccup control process is launched as child process?
Holger
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/giltene/jHiccup/issues/29#issuecomment-336588114, or mute the thread https://github.com/notifications/unsubscribe-auth/AeIgJAsLVlm2-UMCE1vahwwNRY3z8CpMks5sr-5dgaJpZM4P0bZ4 .
Thanks, all makes sense. We'll try to look for I/O tools to help us diagnose this. Thanks for the great tool!
On Sat, Oct 14, 2017 at 1:23 AM, Gil Tene notifications@github.com wrote:
The fact that the long recording interval (~35 seconds at time offset 4953.302 in the file) includes over 33K recorded latencies all of which are shorter than (the max of) 0.688msec is pretty clear evidence that the recording thread was not stalled during the long recording interval, and that it's the logging thread that got stalled for some reason (likely some I/O reason, since the scheduler was clearly happy to run threads when they needed to).
As for the control hiccup log: jHiccup is basically watching for and reporting on the ability of a runnable thread to run. It's not trying to identify or report on any i/o or blocking issues. As such, it's hard to tell what would cause the jHiccup logging thread to block (most likely on I/O) in one process and not in another. But that sort of thing isn't "strange". There can be many normal situations that would delay writes in one file while another keeps writing smoothly. E.g. various pressure and normal and flushing behaviors can explain such a thing (one of the files was unlucky enough to have it's cached pages evicted form the page cache, while another has not). And so sector-specific I/O delays (e.g. a bad sector on a disk, or some blocked-behind-background-GC activity inside an SSD).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/giltene/jHiccup/issues/29#issuecomment-336611319, or mute the thread https://github.com/notifications/unsubscribe-auth/AeIgJNm-bAJsL4oLnmGl_ROMGqzT-00gks5ssEVcgaJpZM4P0bZ4 .
After chatting with an Azul engineer, it was suggested I create this issue to ask for a clarification in the documentation (if no code change is needed) to describe which clocks are being used when printing otu the log line and which clocks are being used to measure pauses (System.currentTimeMillis() and System.naonTime() respectively).
We have an issue (still unsolved at this point) where there was a 35s delta between two log lines (lines before and after are 1s apart). The maximum seen pause during this iteration was only 0.688 which wouldn't seem to be possible if the log line's clock showed a diff of 35s.