sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

Remote Memory Leak in rev 1868be43 #167

Closed r-lindner closed 1 year ago

r-lindner commented 1 year ago

Hello, I upgraded mod-gearman-module from 4.0.2 to 4.0.4 some days ago and noticed the SWAP on my gearman-job-server node went up and up until it was OOM. It is around 120MB per hour in my case. I tested the commits between 4.0.2 and 4.0.3:

[OK ] Naemon 1.3.1 + Mod-Gearman 4.0.2 (18. 18:00 - 19. 10:00)
[ERR] Naemon 1.3.1 + Mod-Gearman 4.0.3 (19. 10:00 - 19. 12:40, +100MB/h)
[OK ] Naemon 1.4.0 + Mod-Gearman 4.0.2 (19. 12:40 - 20. 11:00)
[ERR] Naemon 1.4.0 + Mod-Gearman 4.0.3 (20. 11:00 - 20. 18:00, +1000MB/7h)
[ERR] Naemon 1.4.0 + Mod-Gearman 4.0.2+7 1868be43 (21. 07:30 - 21. 10:30, +100MB/h)
[OK ] Naemon 1.4.0 + Mod-Gearman 4.0.2+4 e3fa5795 (21. 10:30 - 21. 13:30)
[OK ] Naemon 1.4.0 + Mod-Gearman 4.0.2+5 efee02a5 (21. 13:30 - 21. 15:10)
[OK ] Naemon 1.4.0 + Mod-Gearman 4.0.2+6 adc47f0c (21. 15:10 - 21. 18:00)
[ERR] Naemon 1.4.0 + Mod-Gearman 4.0.2+7 1868be43 (21. 18:00 - 22. 03:40)
[OK ] Naemon 1.4.0 + Mod-Gearman 4.0.2+6 adc47f0c (22. 03:40 -

I also tried disabling SWAP (in the first 2 cm of the image) but then RAM usage goes up. The RAM / SWAP on the naemon node where mod-gearman-module is installed does not change.

image

sni commented 1 year ago

I'll have a look.

dlware commented 1 year ago

I'm seeing the same thing. Post upgrade to mod_gearman-4.0.4, RAM usage grows until it exhausts. restart clears it up, but growth begins again. rinse and repeat.

RHEL7 server.

sni commented 1 year ago

it's probably this one: https://github.com/naemon/naemon-core/pull/404

sni commented 1 year ago

Could you try the latest nightly naemon build to see if that helped. I don't see any leaks anymore, regardsless of using mod-gearman or not.

r-lindner commented 1 year ago

sorry, I was out of office for some time... the nightly naemon from the 2022-12-17 had no problems in the last 2 hours, looks good. I was wrong, it went OOM again. I had looked at the memory consumption of the wrong server. :-( I am rolling back the module to adc47f0c again.

lamaral commented 1 year ago

We have also been affected by this after upgrading mod-gearman to 4.0.3. We experienced some oom kills in our instance and upon checking, the gearman-job-server was the culprit.

I think the issue in https://github.com/naemon/naemon-core/pull/404 is unrelated to this, as the leak is not in Naemon itself, but in the gearman-job-server process.

r-lindner commented 1 year ago

The current mod-gearman and Naemon still has the memory leak :-(

tested and not working: mod-gearman 5.0.1 + Naemon 1.4.0 mod-gearman 5.0.1 + Naemon 1.4.0-1 (nightly naemon 2022-12-17) mod-gearman 5.0.1 + Naemon 1.4.1

My last working version is still mod-gearman 4.0.2 adc47f0c, no matter which Naemon version I use.

sni commented 1 year ago

Cannot reproduce this so far. Is it the naemon process which is growing?

r-lindner commented 1 year ago

I have gearman-job-server (and nothing else) on a separate server where other servers (mod-gearman-worker, pnp-gearman, mod-gearman-module) are connecting to. As soon as I install mod-gearman-module > 4.0.2 adc47f0 and restart the naemon process, the RAM+Swap usage on the gearman-job-server host is going up up up.

sni commented 1 year ago

and which gearmand version is that?

r-lindner commented 1 year ago

I tried 1.1.19.1+ds-2+b2 (Debian 11) and 0.33-8 (Consol)

lamaral commented 1 year ago

I ran into the same issue with 1.1.18+ds-3+b3 on Debian 10.

sni commented 1 year ago

i've seen machines with high geamand memory usage but if i restart the service, memory usage is stable again (at least as long as i watched) and i still couldn't reproduce this behaviour i a lab. Does the memory usage increase linear and directly after restarting gearmand?

jframeau commented 1 year ago

Last week: image

omd reload in crontab each day at 1 pm. If no reload, memory usage raises up to 2 Go / week.

Focus just after 1 pm:

image

So next to 1 pm and during 2 hours, memory usage is kind of flat.

gearmand 1.1.20, omd 5.10.

jfr

ghost commented 1 year ago

Same here. Gearmand: v1.1.19.1 Package: 5.00-labs-edition OS: Debian 11 For me, it is 10G in two days and the gearmand service starts to be unresponsive until restart (in most cases restart does not happen nicely so have to kill it forcefully)

sni commented 1 year ago

i see, that's a good point. So i was too impatient... Did run valgrind massif here and got similar results now: 2023-06-19_17-27

sni commented 1 year ago

indeed, seems like 1868be43e61fd12ade8221ea8ad19a8df83df742 introduced this issue. I guess it's the call to gearman_client_add_options( client, GEARMAN_CLIENT_NON_BLOCKING|GEARMAN_CLIENT_FREE_TASKS|GEARMAN_CLIENT_UNBUFFERED_RESULT); which results in gearmand misbehaviour.

Let's see how this can be solved...

sni commented 1 year ago

i switched back to blocking io. This seems to fix the memory issue in gearmand. I'll run some tests to see if this has any performance impacts. So far it looks promising.