Closed pvdputte closed 1 year ago
I've replaced /usr/bin/naemon and /usr/lib/naemon/libnaemon.so.0.0.0 with the 1.0.8 versions to be sure it's not some configuration setting that was overwritten during the upgrade.
I reverted at 13:25 today (rightmost part of the graph). I'm inclined to think there is indeed something going on here?
To give you an idea of the monitoring load: # Active Host / Service Checks: 4629 / 70016 Most of these are done every 5 minutes There are a bunch of other workers in different locations, but these 4 are the busiest ones.
Greetings, Pascal
thats strange, the only change in the scheduling logic is during reloads/restarts. I'll try to repdroduce that.
To make sure I'm not chasing ghosts, I restored the 1.0.9 binaries at 15:43.
It looks more reasonable now (except for worker 1 apparently, but that's the only one where I also did a full apt upgrade while troubleshooting, I'll wait some more to see if it normalizes).
I'm starting to wonder if it's related to the normal apt-get upgrade procedure as I manually stopped and started naemon each time to swap the binaries.
Which reminds me that I couldn't replace the naemon binary while it was running, I got an error like
cp: cannot create regular file '/usr/bin/naemon': Text file busy
But I guess the apt upgrade should have cleanly stopped naemon as well. I need to upgrade a second (somewhat smaller) environment, I'll see if I get the same behavior there.
I did some tests yesterday but without any interesting results. Both, the 1.0.8 and 1.0.9 did exact the same amount of checks and it was exactly the calculated estimated number. I moved 50% to mod-gearman and the other half to naemons worker directly with no difference. I noticed a slightly increased cpu usage for 1.0.9 compared to the 1.0.8 but not significantly higher.
So to move forward, try to find the source of the load. Is it cpu usage, network connections or disk io. Just looking at the load does not help much unfortunately.
It's kinda hard to troubleshoot load ad-hoc as it's far from a synthetic environment, it depends on the state of the network, hosts and different workloads in different datacenters. Some checks take a lot of memory & take quite some time to run and often these same checks are run in parallel so load creeps up for a few minutes.
I've been trying to switch between 1.0.8 and 1.0.9 the last days to find a pattern, but sometimes everything is normal with 1.0.9 as well so...
To be clear, my workers serve two masters. I was holding off to upgrade the second one but it's done now as well for easier comparisons (same version on both)
As I can't wrap my head around it, I've decided to switch versions every two hours during the night and see if something pops up.
In the mean time I was wondering if there's an easy way to see how many checks have been executed by a worker during some timespan?
In the mean time I was wondering if there's an easy way to see how many checks have been executed > by a worker during some timespan? you could use the
check_gearman
plugin. It returns a job counter like this:
..../plugins/check_gearman -H localhost -q worker_$(hostname) -t 10 -s check
check_gearman OK - hostname has 3 worker and is working on 1 jobs. Version: 1.1.0|worker=3;;;2;50 jobs=521729c
It is a good advice to have that checked anyway :-) And as a sideeffect you get nice graphs.
Hi Sven , i dont have so beautifull graph but it seem i encounter same problem . I many Backend different version using naemon/gearmand + many worker . I build 1 naemon container + 1 mod_gearman_worker container on two separate servers. When naemon start all check seems to be scheduled silmultaneously to pass in one time and my worker load go up to 200. It become quieter when i reschedule via Thruk 'core_scheduling.cgi' ==> rebalance all host and services. 126 Hosts and 1300 services ==> LOAD go up to 200
I already use container worker same version, with more check, just backend version is different (Icinga 1.4, Naemon 1.0.8) 540 Host 15000 services on 6 container worker LOAD is under 0.3.
I hope so you will understand my bad English and this post could help .
Concerned version for problem ... naemon-devel-1.0.10-0.x86_64 naemon-core-1.0.10-0.x86_64 naemon-livestatus-1.0.10-0.x86_64 libnaemon-1.0.10-0.x86_64
You enable the core_scheduling
Thruk plugin to get some insights:
It's been a while, and we've been running 1.0.8 all this time to avoid this issue. Because of (yet unexplained) naemon cpu usage and crashes last Monday, I decided to upgrade to 1.2.0 hoping it would at least keep running.
Unfortunately since then we experience higher load on the worker nodes again. Last Saturday I reverted to verify that the load would return to normal.
As suggested by you I added graphing of the number of checks and workers in each queue, and this is the result.
Number of checks running in the queue:
Number of workers in the queue: What I can tell is the amount of workers had been constant, was a lot higher with 1.2.0, back to normal when reverting to 1.0.8, but for some reason sometimes jumped up to the 1.2.0 level again for many hours :/ With 1.2.0 it's constant, i.e. always high. 1.0.8 used to be always low until now.
When 1.0.8 gets back to normal, it's usually after a reload I believe. I also noticed that hardly anything was logged to naemon.log right before the high-load period started, for about two minutes. As if it had been busy and then tried to catch up, causing a spike and then getting in this bad state.
Load on 1 of the 4 workers in this queue:
Looking more in detail at the difference in number of checks running in this queue today:
I have no idea what might be the cause. I hope you do, but I'm afraid this type of issue is not the easiest to track down.
Slowly testing different things, 1.0.8 started crashing a lot again but it now seems fine after increasing livestatus' thread_stack_size. The sudden high worker load periods remain however. But now that it's more or less stable, I noticed that it always starts a bit after 22:00 in the evening. And that's when I see this in syslog:
May 18 22:05:24 myhost kernel: [603938.559272] INFO: rcu_sched detected stalls on CPUs/tasks:
May 18 22:05:24 myhost kernel: [603938.563127] 4-...: (0 ticks this GP) idle=0e1/140000000000000/0 softirq=47085130/47085130 fqs=11
May 18 22:05:24 myhost kernel: [603938.564180] (detected by 2, t=10072 jiffies, g=27657668, c=27657667, q=149)
May 18 22:05:24 myhost kernel: [603938.564607] Task dump for CPU 0:
May 18 22:05:24 myhost kernel: [603938.564609] swapper/0 R running task 0 0 0 0x00000000
May 18 22:05:24 myhost kernel: [603938.564612] ffffffff99611500 ffff881ff643b400 ffffffff99611500 ffff8820f6618980
May 18 22:05:24 myhost kernel: [603938.564614] ffff8820f2a37100 ffffffff99603ef0 ffffffff99015b89 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564616] 0000000000000000 ffff8820f6618980 ffffffffffffff02 ffffffff99611500
May 18 22:05:24 myhost kernel: [603938.564618] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564625] [<ffffffff99015b89>] ? __schedule+0x239/0x6f0
May 18 22:05:24 myhost kernel: [603938.564627] [<ffffffff99016072>] ? schedule+0x32/0x80
May 18 22:05:24 myhost kernel: [603938.564631] [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564634] [<ffffffff9973ef5e>] ? start_kernel+0x447/0x467
May 18 22:05:24 myhost kernel: [603938.564636] [<ffffffff9973e120>] ? early_idt_handler_array+0x120/0x120
May 18 22:05:24 myhost kernel: [603938.564637] [<ffffffff9973e408>] ? x86_64_start_kernel+0x14c/0x170
May 18 22:05:24 myhost kernel: [603938.564637] Task dump for CPU 1:
May 18 22:05:24 myhost kernel: [603938.564638] swapper/1 R running task 0 0 1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564639] 0000000000000400 0000000000000410 0000000000000001 0100000000000000
May 18 22:05:24 myhost kernel: [603938.564641] ffffffff99019ed0 0000000000000000 0000000000000000 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564642] 0000000000000000 ffffffffffffff58 ffffffff9901a1be 0000000000000010
May 18 22:05:24 myhost kernel: [603938.564643] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564646] [<ffffffff99019ed0>] ? __cpuidle_text_start+0x8/0x8
May 18 22:05:24 myhost kernel: [603938.564648] [<ffffffff9901a1be>] ? native_safe_halt+0xe/0x10
May 18 22:05:24 myhost kernel: [603938.564649] [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564650] [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564655] [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
May 18 22:05:24 myhost kernel: [603938.564655] Task dump for CPU 2:
May 18 22:05:24 myhost kernel: [603938.564656] swapper/2 R running task 0 0 1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564657] ffffffff99719900 ffffffff98aa822b 0000000000000002 ffffffff99719900
May 18 22:05:24 myhost kernel: [603938.564658] ffffffff98b81a3e ffff8820f66996c0 ffffffff9964fd40 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564659] ffffffff99719900 00000000ffffffff ffffffff98ae3db4 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564660] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564661] <IRQ>
May 18 22:05:24 myhost kernel: [603938.564664] [<ffffffff98aa822b>] ? sched_show_task+0xcb/0x130
May 18 22:05:24 myhost kernel: [603938.564668] [<ffffffff98b81a3e>] ? rcu_dump_cpu_stacks+0x92/0xb2
May 18 22:05:24 myhost kernel: [603938.564673] [<ffffffff98ae3db4>] ? rcu_check_callbacks+0x854/0x8b0
May 18 22:05:24 myhost kernel: [603938.564677] [<ffffffff98afa260>] ? tick_sched_do_timer+0x30/0x30
May 18 22:05:24 myhost kernel: [603938.564678] [<ffffffff98aea8a8>] ? update_process_times+0x28/0x50
May 18 22:05:24 myhost kernel: [603938.564679] [<ffffffff98af9c60>] ? tick_sched_handle.isra.12+0x20/0x50
May 18 22:05:24 myhost kernel: [603938.564681] [<ffffffff98afa298>] ? tick_sched_timer+0x38/0x70
May 18 22:05:24 myhost kernel: [603938.564682] [<ffffffff98aeb37e>] ? __hrtimer_run_queues+0xde/0x250
May 18 22:05:24 myhost kernel: [603938.564683] [<ffffffff98aeba5c>] ? hrtimer_interrupt+0x9c/0x1a0
May 18 22:05:24 myhost kernel: [603938.564684] [<ffffffff9901ded7>] ? smp_apic_timer_interrupt+0x47/0x60
May 18 22:05:24 myhost kernel: [603938.564686] [<ffffffff9901c766>] ? apic_timer_interrupt+0x96/0xa0
May 18 22:05:24 myhost kernel: [603938.564686] <EOI>
May 18 22:05:24 myhost kernel: [603938.564687] [<ffffffff99019ed0>] ? __cpuidle_text_start+0x8/0x8
May 18 22:05:24 myhost kernel: [603938.564688] [<ffffffff9901a1be>] ? native_safe_halt+0xe/0x10
May 18 22:05:24 myhost kernel: [603938.564689] [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564690] [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564691] [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
May 18 22:05:24 myhost kernel: [603938.564691] Task dump for CPU 3:
May 18 22:05:24 myhost kernel: [603938.564692] swapper/3 R running task 0 0 1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564693] 000225530c0b4600 ffff8820f66d26c0 0000000108fee39f ffffffff98aea75f
May 18 22:05:24 myhost kernel: [603938.564694] 000225530c0b4600 0000000000000003 000225530c1cd7e2 ffff8820f66cfe80
May 18 22:05:24 myhost kernel: [603938.564695] ffffffff98af9e00 00000000f66d8980 00000000003d0900 7fffffffffffffff
May 18 22:05:24 myhost kernel: [603938.564697] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564698] [<ffffffff98aea75f>] ? get_next_timer_interrupt+0x6f/0x120
May 18 22:05:24 myhost kernel: [603938.564699] [<ffffffff98af9e00>] ? __tick_nohz_idle_enter+0x100/0x3e0
May 18 22:05:24 myhost kernel: [603938.564700] [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564701] [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564702] [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
May 18 22:05:24 myhost kernel: [603938.564702] Task dump for CPU 4:
May 18 22:05:24 myhost kernel: [603938.564703] swapper/4 R running task 0 0 1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564704] ffff8820efbcf080 ffff8820f6718980 ffffffff98a2f240 ffffffff98af14fe
May 18 22:05:24 myhost kernel: [603938.564705] ffff8820f4d260c0 0000000000000004 ffffffff997198c0 ffffffff98a2f8f5
May 18 22:05:24 myhost kernel: [603938.564706] ffffffff98af9e00 00000000f6718980 00000000003d0900 7fffffffffffffff
May 18 22:05:24 myhost kernel: [603938.564707] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564710] [<ffffffff98a2f240>] ? recalibrate_cpu_khz+0x10/0x10
May 18 22:05:24 myhost kernel: [603938.564711] [<ffffffff98af14fe>] ? ktime_get+0x3e/0xb0
May 18 22:05:24 myhost kernel: [603938.564712] [<ffffffff98a2f8f5>] ? sched_clock+0x5/0x10
May 18 22:05:24 myhost kernel: [603938.564713] [<ffffffff98af9e00>] ? __tick_nohz_idle_enter+0x100/0x3e0
May 18 22:05:24 myhost kernel: [603938.564714] [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564715] [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564716] [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
May 18 22:05:24 myhost kernel: [603938.564717] Task dump for CPU 5:
May 18 22:05:24 myhost kernel: [603938.564718] naemon R running task 0 17163 1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564719] ffff8820f4d2b100 ffff8820f6758980 0000000000000246 ffff8820f2f30910
May 18 22:05:24 myhost kernel: [603938.564720] ffffffffc05dc4cc ffff9db2819ffa30 ffffffff99015b91 ffff8820f4150000
May 18 22:05:24 myhost kernel: [603938.564721] 00ff8820f03ea0e8 0000000300758980 0000000000000003 0000000000000246
May 18 22:05:24 myhost kernel: [603938.564722] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564729] [<ffffffff99015b89>] ? __schedule+0x239/0x6f0
May 18 22:05:24 myhost kernel: [603938.564731] [<ffffffff99016072>] ? schedule+0x32/0x80
May 18 22:05:24 myhost kernel: [603938.564732] [<ffffffff9901998f>] ? schedule_hrtimeout_range_clock+0x18f/0x1a0
May 18 22:05:24 myhost kernel: [603938.564733] [<ffffffff98abd30b>] ? add_wait_queue+0x2b/0x40
May 18 22:05:24 myhost kernel: [603938.564736] [<ffffffff98f75cb0>] ? tcp_init_xmit_timers+0x20/0x20
May 18 22:05:24 myhost kernel: [603938.564740] [<ffffffff98c213e1>] ? poll_schedule_timeout+0x41/0x70
May 18 22:05:24 myhost kernel: [603938.564741] [<ffffffff98c22a6d>] ? do_sys_poll+0x4ad/0x560
May 18 22:05:24 myhost kernel: [603938.564745] [<ffffffff98a804cf>] ? __local_bh_enable_ip+0x8f/0xa0
May 18 22:05:24 myhost kernel: [603938.564748] [<ffffffff98f57d7a>] ? ip_finish_output2+0x19a/0x430
May 18 22:05:24 myhost kernel: [603938.564753] [<ffffffff98f4b9e4>] ? nf_iterate+0x54/0x60
May 18 22:05:24 myhost kernel: [603938.564755] [<ffffffff98f58c3a>] ? ip_output+0x6a/0xf0
May 18 22:05:24 myhost kernel: [603938.564756] [<ffffffff98f58090>] ? __ip_flush_pending_frames.isra.43+0x80/0x80
May 18 22:05:24 myhost kernel: [603938.564758] [<ffffffff98c215b0>] ? poll_select_copy_remaining+0x150/0x150
May 18 22:05:24 myhost kernel: [603938.564761] [<ffffffff98ef7524>] ? sk_reset_timer+0x14/0x30
May 18 22:05:24 myhost kernel: [603938.564762] [<ffffffff98f70935>] ? tcp_schedule_loss_probe+0x115/0x1a0
May 18 22:05:24 myhost kernel: [603938.564763] [<ffffffff98f71940>] ? tcp_write_xmit+0x440/0xf90
May 18 22:05:24 myhost kernel: [603938.564764] [<ffffffff98ef8003>] ? release_sock+0x43/0x90
May 18 22:05:24 myhost kernel: [603938.564767] [<ffffffff98f6408f>] ? tcp_sendmsg+0x65f/0xc40
May 18 22:05:24 myhost kernel: [603938.564770] [<ffffffff98f904c3>] ? inet_recvmsg+0x83/0xb0
May 18 22:05:24 myhost kernel: [603938.564773] [<ffffffff98ef383d>] ? SYSC_recvfrom+0x11d/0x130
May 18 22:05:24 myhost kernel: [603938.564775] [<ffffffff98c5522f>] ? ep_poll+0x32f/0x350
May 18 22:05:24 myhost kernel: [603938.564777] [<ffffffff98a8bd39>] ? do_sigaction+0x199/0x1e0
May 18 22:05:24 myhost kernel: [603938.564778] [<ffffffff98c569fc>] ? SyS_epoll_wait+0x9c/0xe0
May 18 22:05:24 myhost kernel: [603938.564780] [<ffffffff98a03b7d>] ? do_syscall_64+0x8d/0x100
May 18 22:05:24 myhost kernel: [603938.564780] Task dump for CPU 6:
May 18 22:05:24 myhost kernel: [603938.564781] swapper/6 R running task 0 0 1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564782] 0000000000000400 0000000000000410 0000000000000010 0100000000000000
May 18 22:05:24 myhost kernel: [603938.564783] ffffffff99019ed0 0000000000000000 0000000000000000 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564784] 0000000000000000 ffffffffffffff90 ffffffff9901a1be 0000000000000010
May 18 22:05:24 myhost kernel: [603938.564785] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564786] [<ffffffff99019ed0>] ? __cpuidle_text_start+0x8/0x8
May 18 22:05:24 myhost kernel: [603938.564787] [<ffffffff9901a1be>] ? native_safe_halt+0xe/0x10
May 18 22:05:24 myhost kernel: [603938.564788] [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564789] [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564790] [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
May 18 22:05:24 myhost kernel: [603938.564790] Task dump for CPU 7:
May 18 22:05:24 myhost kernel: [603938.564791] swapper/7 R running task 0 0 1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564792] 0000000000000400 0000000000000000 0000000000000000 0100000000000000
May 18 22:05:24 myhost kernel: [603938.564793] ffffffff99019ed0 0000000000000000 0000000000000000 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564794] 0000000000000000 ffffffffffffff61 ffffffff9901a1be 0000000000000010
May 18 22:05:24 myhost kernel: [603938.564795] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564796] [<ffffffff99019ed0>] ? __cpuidle_text_start+0x8/0x8
May 18 22:05:24 myhost kernel: [603938.564797] [<ffffffff9901a1be>] ? native_safe_halt+0xe/0x10
May 18 22:05:24 myhost kernel: [603938.564797] [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564798] [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564799] [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
Nevermind, I found the cause of that cpu stall: veeam snapshotting. Which froze the naemon vm for too long after which it probably wanted to schedule too much too quickly, causing a monworker load spike. I'm still unsure why that would go on for hours though, but I've seen many cases where gearmand or workers act odd after communication issues.
Fixed by deleting a regular VM snapshot that caused veeam to take much longer than normal.
As I'm now in a stable situation again, I'll try 1.2.0 once more :-)
Thats one of the reasons why we rewrote the worker in golang meanwhile: https://github.com/ConSol/mod-gearman-worker-go/. Give it a try. It uses less resources than the c worker and should be more reliable on connection failures.
Thanks for the tip, I definitely will!
FYI I'm in the process of upgrading our workers to Debian bookworm, with the Go mod-gearman-worker. During the upgrade, my staging gearman-job-server often got into the bad state where it wouldn't handle connections properly anymore (piling up CLOSE_WAIT connections etc., cfr. https://bugs.launchpad.net/gearmand/+bug/1176324/ )
I even started running gearmand 0.33 in a container for a short while :slightly_smiling_face: but this didn't help anything.
Digging a bit further, it turned out that the issue was triggered just by shutting down the old VM with the legacy mod-gearman-worker.
I tried to reproduce this with the new Go worker, but this seemed to behave much better.
Closing this issue. :+1:
Are there any known performance regressions from naemon 1.0.8 => 1.0.9 or mod-gearman 3.0.6 => 3.0.7? I upgraded last night and I notice that my mod-gearman workers' load is clearly a lot higher.
Used versions on the naemon/gearmand server:
Some graphs:
The load on the mod-gearman worker VMs has almost tripled:
Command execution time has increased slightly: