naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
154 stars 63 forks source link

high mod-gearman worker load after upgrading to naemon 1.0.9 #281

Closed pvdputte closed 1 year ago

pvdputte commented 5 years ago

Are there any known performance regressions from naemon 1.0.8 => 1.0.9 or mod-gearman 3.0.6 => 3.0.7? I upgraded last night and I notice that my mod-gearman workers' load is clearly a lot higher.

2019-02-22 01:38:22 upgrade gearman-job-server:amd64 0.33-6 0.33-7
2019-02-22 01:39:45 upgrade gearman-tools:amd64 0.33-6 0.33-7
2019-02-22 01:39:45 upgrade libgearman7:amd64 0.33-6 0.33-7
2019-02-22 01:39:45 upgrade libnaemon:amd64 1.0.7 1.0.9
2019-02-22 01:39:53 upgrade naemon-core:amd64 1.0.8 1.0.9
2019-02-22 01:39:55 upgrade naemon-livestatus:amd64 1.0.8 1.0.9
2019-02-22 01:39:58 upgrade mod-gearman-module:amd64 3.0.6 3.0.7
2019-02-22 01:39:58 upgrade mod-gearman-tools:amd64 3.0.5 3.0.7

Used versions on the naemon/gearmand server:

ii  gearman-job-server                   0.33-7                         amd64        Job server for the Gearman distributed job queue
ii  gearman-tools                        0.33-7                         amd64        Tools for the Gearman distributed job queue
ii  libgearman7                          0.33-7                         amd64        Library providing Gearman client and worker functions
ii  libnaemon                            1.0.9                          amd64        Library for Naemon - common data files
ii  mod-gearman-module                   3.0.7                          amd64        Event broker module to distribute service checks.
ii  mod-gearman-tools                    3.0.7                          amd64        Tools for mod-gearman
ii  naemon-core                          1.0.9                          amd64        host/service/network monitoring and management system
ii  naemon-livestatus                    1.0.9                          amd64        contains the Naemon livestatus eventbroker module

Some graphs:

The load on the mod-gearman worker VMs has almost tripled: load

Command execution time has increased slightly: exectime

pvdputte commented 5 years ago

I've replaced /usr/bin/naemon and /usr/lib/naemon/libnaemon.so.0.0.0 with the 1.0.8 versions to be sure it's not some configuration setting that was overwritten during the upgrade.

I reverted at 13:25 today (rightmost part of the graph). I'm inclined to think there is indeed something going on here?

2019-02-25_14-24-56

To give you an idea of the monitoring load: # Active Host / Service Checks: 4629 / 70016 Most of these are done every 5 minutes There are a bunch of other workers in different locations, but these 4 are the busiest ones.

Greetings, Pascal

sni commented 5 years ago

thats strange, the only change in the scheduling logic is during reloads/restarts. I'll try to repdroduce that.

pvdputte commented 5 years ago

To make sure I'm not chasing ghosts, I restored the 1.0.9 binaries at 15:43.

2019-02-25_17-05-24

It looks more reasonable now (except for worker 1 apparently, but that's the only one where I also did a full apt upgrade while troubleshooting, I'll wait some more to see if it normalizes).

I'm starting to wonder if it's related to the normal apt-get upgrade procedure as I manually stopped and started naemon each time to swap the binaries.

Which reminds me that I couldn't replace the naemon binary while it was running, I got an error like

cp: cannot create regular file '/usr/bin/naemon': Text file busy

But I guess the apt upgrade should have cleanly stopped naemon as well. I need to upgrade a second (somewhat smaller) environment, I'll see if I get the same behavior there.

sni commented 5 years ago

I did some tests yesterday but without any interesting results. Both, the 1.0.8 and 1.0.9 did exact the same amount of checks and it was exactly the calculated estimated number. I moved 50% to mod-gearman and the other half to naemons worker directly with no difference. I noticed a slightly increased cpu usage for 1.0.9 compared to the 1.0.8 but not significantly higher.

So to move forward, try to find the source of the load. Is it cpu usage, network connections or disk io. Just looking at the load does not help much unfortunately.

pvdputte commented 5 years ago

It's kinda hard to troubleshoot load ad-hoc as it's far from a synthetic environment, it depends on the state of the network, hosts and different workloads in different datacenters. Some checks take a lot of memory & take quite some time to run and often these same checks are run in parallel so load creeps up for a few minutes.

I've been trying to switch between 1.0.8 and 1.0.9 the last days to find a pattern, but sometimes everything is normal with 1.0.9 as well so...

To be clear, my workers serve two masters. I was holding off to upgrade the second one but it's done now as well for easier comparisons (same version on both)

2019-02-28_13-29-38

As I can't wrap my head around it, I've decided to switch versions every two hours during the night and see if something pops up.

In the mean time I was wondering if there's an easy way to see how many checks have been executed by a worker during some timespan?

sni commented 5 years ago

In the mean time I was wondering if there's an easy way to see how many checks have been executed > by a worker during some timespan? you could use the check_gearman plugin. It returns a job counter like this:

..../plugins/check_gearman -H localhost -q worker_$(hostname) -t 10 -s check
check_gearman OK - hostname has 3 worker and is working on 1 jobs. Version: 1.1.0|worker=3;;;2;50 jobs=521729c

It is a good advice to have that checked anyway :-) And as a sideeffect you get nice graphs.

sdouce commented 5 years ago

Hi Sven , i dont have so beautifull graph but it seem i encounter same problem . I many Backend different version using naemon/gearmand + many worker . I build 1 naemon container + 1 mod_gearman_worker container on two separate servers. When naemon start all check seems to be scheduled silmultaneously to pass in one time and my worker load go up to 200. It become quieter when i reschedule via Thruk 'core_scheduling.cgi' ==> rebalance all host and services. 126 Hosts and 1300 services ==> LOAD go up to 200

I already use container worker same version, with more check, just backend version is different (Icinga 1.4, Naemon 1.0.8) 540 Host 15000 services on 6 container worker LOAD is under 0.3.

I hope so you will understand my bad English and this post could help .

Concerned version for problem ... naemon-devel-1.0.10-0.x86_64 naemon-core-1.0.10-0.x86_64 naemon-livestatus-1.0.10-0.x86_64 libnaemon-1.0.10-0.x86_64

sni commented 5 years ago

You enable the core_scheduling Thruk plugin to get some insights:

Screenshot 2019-08-18 at 09 39 19
pvdputte commented 4 years ago

It's been a while, and we've been running 1.0.8 all this time to avoid this issue. Because of (yet unexplained) naemon cpu usage and crashes last Monday, I decided to upgrade to 1.2.0 hoping it would at least keep running.

Unfortunately since then we experience higher load on the worker nodes again. Last Saturday I reverted to verify that the load would return to normal.

As suggested by you I added graphing of the number of checks and workers in each queue, and this is the result.

Number of checks running in the queue: image

Number of workers in the queue: image What I can tell is the amount of workers had been constant, was a lot higher with 1.2.0, back to normal when reverting to 1.0.8, but for some reason sometimes jumped up to the 1.2.0 level again for many hours :/ With 1.2.0 it's constant, i.e. always high. 1.0.8 used to be always low until now.

When 1.0.8 gets back to normal, it's usually after a reload I believe. I also noticed that hardly anything was logged to naemon.log right before the high-load period started, for about two minutes. As if it had been busy and then tried to catch up, causing a spike and then getting in this bad state. image

Load on 1 of the 4 workers in this queue: image

Looking more in detail at the difference in number of checks running in this queue today: image

I have no idea what might be the cause. I hope you do, but I'm afraid this type of issue is not the easiest to track down.

pvdputte commented 4 years ago

Slowly testing different things, 1.0.8 started crashing a lot again but it now seems fine after increasing livestatus' thread_stack_size. The sudden high worker load periods remain however. But now that it's more or less stable, I noticed that it always starts a bit after 22:00 in the evening. And that's when I see this in syslog:

May 18 22:05:24 myhost kernel: [603938.559272] INFO: rcu_sched detected stalls on CPUs/tasks:
May 18 22:05:24 myhost kernel: [603938.563127]    4-...: (0 ticks this GP) idle=0e1/140000000000000/0 softirq=47085130/47085130 fqs=11 
May 18 22:05:24 myhost kernel: [603938.564180]    (detected by 2, t=10072 jiffies, g=27657668, c=27657667, q=149)
May 18 22:05:24 myhost kernel: [603938.564607] Task dump for CPU 0:
May 18 22:05:24 myhost kernel: [603938.564609] swapper/0       R  running task        0     0      0 0x00000000
May 18 22:05:24 myhost kernel: [603938.564612]  ffffffff99611500 ffff881ff643b400 ffffffff99611500 ffff8820f6618980
May 18 22:05:24 myhost kernel: [603938.564614]  ffff8820f2a37100 ffffffff99603ef0 ffffffff99015b89 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564616]  0000000000000000 ffff8820f6618980 ffffffffffffff02 ffffffff99611500
May 18 22:05:24 myhost kernel: [603938.564618] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564625]  [<ffffffff99015b89>] ? __schedule+0x239/0x6f0
May 18 22:05:24 myhost kernel: [603938.564627]  [<ffffffff99016072>] ? schedule+0x32/0x80
May 18 22:05:24 myhost kernel: [603938.564631]  [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564634]  [<ffffffff9973ef5e>] ? start_kernel+0x447/0x467
May 18 22:05:24 myhost kernel: [603938.564636]  [<ffffffff9973e120>] ? early_idt_handler_array+0x120/0x120
May 18 22:05:24 myhost kernel: [603938.564637]  [<ffffffff9973e408>] ? x86_64_start_kernel+0x14c/0x170
May 18 22:05:24 myhost kernel: [603938.564637] Task dump for CPU 1:
May 18 22:05:24 myhost kernel: [603938.564638] swapper/1       R  running task        0     0      1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564639]  0000000000000400 0000000000000410 0000000000000001 0100000000000000
May 18 22:05:24 myhost kernel: [603938.564641]  ffffffff99019ed0 0000000000000000 0000000000000000 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564642]  0000000000000000 ffffffffffffff58 ffffffff9901a1be 0000000000000010
May 18 22:05:24 myhost kernel: [603938.564643] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564646]  [<ffffffff99019ed0>] ? __cpuidle_text_start+0x8/0x8
May 18 22:05:24 myhost kernel: [603938.564648]  [<ffffffff9901a1be>] ? native_safe_halt+0xe/0x10
May 18 22:05:24 myhost kernel: [603938.564649]  [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564650]  [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564655]  [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
May 18 22:05:24 myhost kernel: [603938.564655] Task dump for CPU 2:
May 18 22:05:24 myhost kernel: [603938.564656] swapper/2       R  running task        0     0      1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564657]  ffffffff99719900 ffffffff98aa822b 0000000000000002 ffffffff99719900
May 18 22:05:24 myhost kernel: [603938.564658]  ffffffff98b81a3e ffff8820f66996c0 ffffffff9964fd40 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564659]  ffffffff99719900 00000000ffffffff ffffffff98ae3db4 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564660] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564661]  <IRQ> 
May 18 22:05:24 myhost kernel: [603938.564664]  [<ffffffff98aa822b>] ? sched_show_task+0xcb/0x130
May 18 22:05:24 myhost kernel: [603938.564668]  [<ffffffff98b81a3e>] ? rcu_dump_cpu_stacks+0x92/0xb2
May 18 22:05:24 myhost kernel: [603938.564673]  [<ffffffff98ae3db4>] ? rcu_check_callbacks+0x854/0x8b0
May 18 22:05:24 myhost kernel: [603938.564677]  [<ffffffff98afa260>] ? tick_sched_do_timer+0x30/0x30
May 18 22:05:24 myhost kernel: [603938.564678]  [<ffffffff98aea8a8>] ? update_process_times+0x28/0x50
May 18 22:05:24 myhost kernel: [603938.564679]  [<ffffffff98af9c60>] ? tick_sched_handle.isra.12+0x20/0x50
May 18 22:05:24 myhost kernel: [603938.564681]  [<ffffffff98afa298>] ? tick_sched_timer+0x38/0x70
May 18 22:05:24 myhost kernel: [603938.564682]  [<ffffffff98aeb37e>] ? __hrtimer_run_queues+0xde/0x250
May 18 22:05:24 myhost kernel: [603938.564683]  [<ffffffff98aeba5c>] ? hrtimer_interrupt+0x9c/0x1a0
May 18 22:05:24 myhost kernel: [603938.564684]  [<ffffffff9901ded7>] ? smp_apic_timer_interrupt+0x47/0x60
May 18 22:05:24 myhost kernel: [603938.564686]  [<ffffffff9901c766>] ? apic_timer_interrupt+0x96/0xa0
May 18 22:05:24 myhost kernel: [603938.564686]  <EOI> 
May 18 22:05:24 myhost kernel: [603938.564687]  [<ffffffff99019ed0>] ? __cpuidle_text_start+0x8/0x8
May 18 22:05:24 myhost kernel: [603938.564688]  [<ffffffff9901a1be>] ? native_safe_halt+0xe/0x10
May 18 22:05:24 myhost kernel: [603938.564689]  [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564690]  [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564691]  [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
May 18 22:05:24 myhost kernel: [603938.564691] Task dump for CPU 3:
May 18 22:05:24 myhost kernel: [603938.564692] swapper/3       R  running task        0     0      1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564693]  000225530c0b4600 ffff8820f66d26c0 0000000108fee39f ffffffff98aea75f
May 18 22:05:24 myhost kernel: [603938.564694]  000225530c0b4600 0000000000000003 000225530c1cd7e2 ffff8820f66cfe80
May 18 22:05:24 myhost kernel: [603938.564695]  ffffffff98af9e00 00000000f66d8980 00000000003d0900 7fffffffffffffff
May 18 22:05:24 myhost kernel: [603938.564697] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564698]  [<ffffffff98aea75f>] ? get_next_timer_interrupt+0x6f/0x120
May 18 22:05:24 myhost kernel: [603938.564699]  [<ffffffff98af9e00>] ? __tick_nohz_idle_enter+0x100/0x3e0
May 18 22:05:24 myhost kernel: [603938.564700]  [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564701]  [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564702]  [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
May 18 22:05:24 myhost kernel: [603938.564702] Task dump for CPU 4:
May 18 22:05:24 myhost kernel: [603938.564703] swapper/4       R  running task        0     0      1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564704]  ffff8820efbcf080 ffff8820f6718980 ffffffff98a2f240 ffffffff98af14fe
May 18 22:05:24 myhost kernel: [603938.564705]  ffff8820f4d260c0 0000000000000004 ffffffff997198c0 ffffffff98a2f8f5
May 18 22:05:24 myhost kernel: [603938.564706]  ffffffff98af9e00 00000000f6718980 00000000003d0900 7fffffffffffffff
May 18 22:05:24 myhost kernel: [603938.564707] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564710]  [<ffffffff98a2f240>] ? recalibrate_cpu_khz+0x10/0x10
May 18 22:05:24 myhost kernel: [603938.564711]  [<ffffffff98af14fe>] ? ktime_get+0x3e/0xb0
May 18 22:05:24 myhost kernel: [603938.564712]  [<ffffffff98a2f8f5>] ? sched_clock+0x5/0x10
May 18 22:05:24 myhost kernel: [603938.564713]  [<ffffffff98af9e00>] ? __tick_nohz_idle_enter+0x100/0x3e0
May 18 22:05:24 myhost kernel: [603938.564714]  [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564715]  [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564716]  [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
May 18 22:05:24 myhost kernel: [603938.564717] Task dump for CPU 5:
May 18 22:05:24 myhost kernel: [603938.564718] naemon          R  running task        0 17163      1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564719]  ffff8820f4d2b100 ffff8820f6758980 0000000000000246 ffff8820f2f30910
May 18 22:05:24 myhost kernel: [603938.564720]  ffffffffc05dc4cc ffff9db2819ffa30 ffffffff99015b91 ffff8820f4150000
May 18 22:05:24 myhost kernel: [603938.564721]  00ff8820f03ea0e8 0000000300758980 0000000000000003 0000000000000246
May 18 22:05:24 myhost kernel: [603938.564722] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564729]  [<ffffffff99015b89>] ? __schedule+0x239/0x6f0
May 18 22:05:24 myhost kernel: [603938.564731]  [<ffffffff99016072>] ? schedule+0x32/0x80
May 18 22:05:24 myhost kernel: [603938.564732]  [<ffffffff9901998f>] ? schedule_hrtimeout_range_clock+0x18f/0x1a0
May 18 22:05:24 myhost kernel: [603938.564733]  [<ffffffff98abd30b>] ? add_wait_queue+0x2b/0x40
May 18 22:05:24 myhost kernel: [603938.564736]  [<ffffffff98f75cb0>] ? tcp_init_xmit_timers+0x20/0x20
May 18 22:05:24 myhost kernel: [603938.564740]  [<ffffffff98c213e1>] ? poll_schedule_timeout+0x41/0x70
May 18 22:05:24 myhost kernel: [603938.564741]  [<ffffffff98c22a6d>] ? do_sys_poll+0x4ad/0x560
May 18 22:05:24 myhost kernel: [603938.564745]  [<ffffffff98a804cf>] ? __local_bh_enable_ip+0x8f/0xa0
May 18 22:05:24 myhost kernel: [603938.564748]  [<ffffffff98f57d7a>] ? ip_finish_output2+0x19a/0x430
May 18 22:05:24 myhost kernel: [603938.564753]  [<ffffffff98f4b9e4>] ? nf_iterate+0x54/0x60
May 18 22:05:24 myhost kernel: [603938.564755]  [<ffffffff98f58c3a>] ? ip_output+0x6a/0xf0
May 18 22:05:24 myhost kernel: [603938.564756]  [<ffffffff98f58090>] ? __ip_flush_pending_frames.isra.43+0x80/0x80
May 18 22:05:24 myhost kernel: [603938.564758]  [<ffffffff98c215b0>] ? poll_select_copy_remaining+0x150/0x150
May 18 22:05:24 myhost kernel: [603938.564761]  [<ffffffff98ef7524>] ? sk_reset_timer+0x14/0x30
May 18 22:05:24 myhost kernel: [603938.564762]  [<ffffffff98f70935>] ? tcp_schedule_loss_probe+0x115/0x1a0
May 18 22:05:24 myhost kernel: [603938.564763]  [<ffffffff98f71940>] ? tcp_write_xmit+0x440/0xf90
May 18 22:05:24 myhost kernel: [603938.564764]  [<ffffffff98ef8003>] ? release_sock+0x43/0x90
May 18 22:05:24 myhost kernel: [603938.564767]  [<ffffffff98f6408f>] ? tcp_sendmsg+0x65f/0xc40
May 18 22:05:24 myhost kernel: [603938.564770]  [<ffffffff98f904c3>] ? inet_recvmsg+0x83/0xb0
May 18 22:05:24 myhost kernel: [603938.564773]  [<ffffffff98ef383d>] ? SYSC_recvfrom+0x11d/0x130
May 18 22:05:24 myhost kernel: [603938.564775]  [<ffffffff98c5522f>] ? ep_poll+0x32f/0x350
May 18 22:05:24 myhost kernel: [603938.564777]  [<ffffffff98a8bd39>] ? do_sigaction+0x199/0x1e0
May 18 22:05:24 myhost kernel: [603938.564778]  [<ffffffff98c569fc>] ? SyS_epoll_wait+0x9c/0xe0
May 18 22:05:24 myhost kernel: [603938.564780]  [<ffffffff98a03b7d>] ? do_syscall_64+0x8d/0x100
May 18 22:05:24 myhost kernel: [603938.564780] Task dump for CPU 6:
May 18 22:05:24 myhost kernel: [603938.564781] swapper/6       R  running task        0     0      1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564782]  0000000000000400 0000000000000410 0000000000000010 0100000000000000
May 18 22:05:24 myhost kernel: [603938.564783]  ffffffff99019ed0 0000000000000000 0000000000000000 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564784]  0000000000000000 ffffffffffffff90 ffffffff9901a1be 0000000000000010
May 18 22:05:24 myhost kernel: [603938.564785] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564786]  [<ffffffff99019ed0>] ? __cpuidle_text_start+0x8/0x8
May 18 22:05:24 myhost kernel: [603938.564787]  [<ffffffff9901a1be>] ? native_safe_halt+0xe/0x10
May 18 22:05:24 myhost kernel: [603938.564788]  [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564789]  [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564790]  [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
May 18 22:05:24 myhost kernel: [603938.564790] Task dump for CPU 7:
May 18 22:05:24 myhost kernel: [603938.564791] swapper/7       R  running task        0     0      1 0x00000000
May 18 22:05:24 myhost kernel: [603938.564792]  0000000000000400 0000000000000000 0000000000000000 0100000000000000
May 18 22:05:24 myhost kernel: [603938.564793]  ffffffff99019ed0 0000000000000000 0000000000000000 0000000000000000
May 18 22:05:24 myhost kernel: [603938.564794]  0000000000000000 ffffffffffffff61 ffffffff9901a1be 0000000000000010
May 18 22:05:24 myhost kernel: [603938.564795] Call Trace:
May 18 22:05:24 myhost kernel: [603938.564796]  [<ffffffff99019ed0>] ? __cpuidle_text_start+0x8/0x8
May 18 22:05:24 myhost kernel: [603938.564797]  [<ffffffff9901a1be>] ? native_safe_halt+0xe/0x10
May 18 22:05:24 myhost kernel: [603938.564797]  [<ffffffff99019eea>] ? default_idle+0x1a/0xd0
May 18 22:05:24 myhost kernel: [603938.564798]  [<ffffffff98abdf4a>] ? cpu_startup_entry+0x1ca/0x240
May 18 22:05:24 myhost kernel: [603938.564799]  [<ffffffff98a4aa50>] ? start_secondary+0x170/0x1b0
pvdputte commented 4 years ago

Nevermind, I found the cause of that cpu stall: veeam snapshotting. Which froze the naemon vm for too long after which it probably wanted to schedule too much too quickly, causing a monworker load spike. I'm still unsure why that would go on for hours though, but I've seen many cases where gearmand or workers act odd after communication issues.

Fixed by deleting a regular VM snapshot that caused veeam to take much longer than normal.

As I'm now in a stable situation again, I'll try 1.2.0 once more :-)

sni commented 4 years ago

Thats one of the reasons why we rewrote the worker in golang meanwhile: https://github.com/ConSol/mod-gearman-worker-go/. Give it a try. It uses less resources than the c worker and should be more reliable on connection failures.

pvdputte commented 4 years ago

Thanks for the tip, I definitely will!

pvdputte commented 1 year ago

FYI I'm in the process of upgrading our workers to Debian bookworm, with the Go mod-gearman-worker. During the upgrade, my staging gearman-job-server often got into the bad state where it wouldn't handle connections properly anymore (piling up CLOSE_WAIT connections etc., cfr. https://bugs.launchpad.net/gearmand/+bug/1176324/ )

I even started running gearmand 0.33 in a container for a short while :slightly_smiling_face: but this didn't help anything.

Digging a bit further, it turned out that the issue was triggered just by shutting down the old VM with the legacy mod-gearman-worker.

I tried to reproduce this with the new Go worker, but this seemed to behave much better.

Closing this issue. :+1: