sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

odd issue with corrupted jobs #67

Closed jmcook closed 9 years ago

jmcook commented 9 years ago

Sven,

We're noticing this on occasion on our gearman workers, and noticing a small amount of services aren't updating as we expect.. It sort of appears Naemon believes they've run, but their last_update never gets updated, and no result ever appears to be received by Naemon.. This is what we're seeing in the logs:

Sep 30 19:06:08 naemon-worker2 mod_gearman_worker: [12918][TRACE] set_state(0) Sep 30 19:06:08 naemon-worker2 mod_gearman_worker: [12918][TRACE] get_job() Sep 30 19:06:08 naemon-worker2 mod_gearman_worker: [12918][TRACE] got new job H:naemon-server1:817921349 Sep 30 19:06:08 naemon-worker2 mod_gearman_worker: [12918][TRACE] 5 +++>#012check#012<+++ Sep 30 19:06:08 naemon-worker2 mod_gearman_worker: [12918][TRACE] 16 --->#012üâûØ<82>ŤØ<8b>Ý<97>èM#007Ûê#012<--- Sep 30 19:06:08 naemon-worker2 mod_gearman_worker: [12918][ERROR] discarded invalid job (H:naemon-server1:817921349), check your encryption settings Sep 30 19:06:08 naemon-worker2 mod_gearman_worker: [12918][TRACE] set_state(1)

It appears like an empty/badly crypted job - but I can't verify it as I can't find the matching job id in the NEB debug log... it appears to not capture it when it submits the job to gearman. This does happen on all my workers, but only in something like 1 in 250 jobs. The others are processed as expected.

It'd odd, because I know the encryption settings are correct - on the server and the workers.

Any ideas where I can start looking?

sni commented 9 years ago

The jobs seems way to small for a valid job and it seems to be not encrypted correctly. Since you know which host/service is not updated correctly, you might find the corresponding entry in the NEB log for further details, maybe not by the job id, but by the host/service name.

jmcook commented 9 years ago

oh, I'm sure I can trace down which service is - as their last check time data doesn't get updated. It's like there's a few services which just don't get jobs inserted correctly, then either on the next run or the run after that they'll go through and all is well.

jmcook commented 9 years ago

So I guess my point is, at some arbitrary point the job will get added to gearman incorrectly it seems. Sometime later (perhaps on the next run, perhaps several runs later), the job will go into the queue correctly and be executed, but one or more polling intervals will have been missed.

jmcook commented 9 years ago

Also, I can reproduce this with or without encryption

jmcook commented 9 years ago

Here's some more info.. Even when Naemon shouldn't be sending anything to mod_gearman (everything is excluded using localhostgroup), I'm still getting a single host and service check every minute or so.. It doesn't really seem to be doing anything, as you can see from the gearmand debug log..

Oct 1 14:58:06 naemon-test1 gearmand[12005]: NOTICE 2014-09-01 14:58:06.818946 [ proc ] accepted,host,check,0 -> libgearman-server/server.c:274 Oct 1 14:58:07 naemon-test1 gearmand[12005]: NOTICE 2014-09-01 14:58:07.796995 [ proc ] accepted,service,check,0 -> libgearman-server/server.c:274 Oct 1 14:59:06 naemon-test1 gearmand[12005]: NOTICE 2014-09-01 14:59:06.903890 [ proc ] accepted,host,check,0 -> libgearman-server/server.c:274 Oct 1 14:59:07 naemon-test1 gearmand[12005]: NOTICE 2014-09-01 14:59:07.878163 [ proc ] accepted,service,check,0 -> libgearman-server/server.c:274 Oct 1 15:00:06 naemon-test1 gearmand[12005]: NOTICE 2014-09-01 15:00:06.975419 [ proc ] accepted,host,check,0 -> libgearman-server/server.c:274 Oct 1 15:00:07 naemon-test1 gearmand[12005]: NOTICE 2014-09-01 15:00:07.960394 [ proc ] accepted,service,check,0 -> libgearman-server/server.c:274 Oct 1 15:01:06 naemon-test1 gearmand[12005]: NOTICE 2014-09-01 15:01:06.132345 [ proc ] accepted,host,check,0 -> libgearman-server/server.c:274 Oct 1 15:01:07 naemon-test1 gearmand[12005]: NOTICE 2014-09-01 15:01:07.032506 [ proc ] accepted,service,check,0 -> libgearman-server/server.c:274

Notice that the UID for the checks appear to be 'check' rather than something like:

Oct 1 15:01:24 naemon-test2 gearmand[12005]: NOTICE 2014-09-01 15:01:24.198577 [ proc ] accepted,host,216d29a0-32a4-4a12-8e6f-e999b70aca7c,0 -> libgearman-server/server.c:274 Oct 1 15:01:24 naemon-test2 gearmand[12005]: NOTICE 2014-09-01 15:01:24.240088 [ proc ] accepted,naemon_results,221a5493-7e0c-4bb0-91a7-5092a727276f,0 -> libgearman-server/server.c:274

jmcook commented 9 years ago

Okay may need to disregard this, it looks like maybe this is coming from check_gearman - will let you know!

jmcook commented 9 years ago

Confirmed, it's just a misconfiguration of check_gearman - it was driving me crazy!

sni commented 9 years ago

great. thanks for the update