timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-68707] Cron Triggers Being Missed During Quiet-down Mode #6055

Open timja opened 2 years ago

timja commented 2 years ago

We're using Jenkins as something like a central scheduler. On a busy night, it might queue and execute around 800 jobs per hour. During maintenance, we put Jenkins into quiet-down and let the jobs queue up while we complete the maintenance.

Unfortunately, it seems like we're starting to miss dozens of jobs during those quiet-down periods. Some jobs get queued up and placed in queue.xml, but an unknown number of other jobs never make it into the queue and are missed entirely. In the jenkins.log, we see it littered with Young GCs and messages like this:

2022-05-13T19:39:05.214-0500: [GC (Allocation Failure) [PSYoungGen: 8368320K->15447K(8371200K)] 24305413K->15952724K(31439872K), 0.0728277 secs] [Times: user=0.51 sys=1.16, real=0.07 secs] 
2022-05-13T19:39:35.244-0500: [GC (Allocation Failure) [PSYoungGen: 8369239K->14256K(8371200K)] 24306516K->15951893K(31439872K), 0.0919080 secs] [Times: user=0.70 sys=1.43, real=0.09 secs] 
2022-05-13T19:40:04.589-0500: [GC (Allocation Failure) [PSYoungGen: 8368048K->14064K(8371200K)] 24305685K->15952197K(31439872K), 0.1147281 secs] [Times: user=1.54 sys=1.07, real=0.12 secs] 
2022-05-14 00:40:27.195+0000 [id=123]    WARNING    hudson.triggers.Trigger#checkTriggers: Trigger hudson.triggers.TimerTrigger.run() triggered by hudson.model.FreeStyleProject@560953a3[XXX-REDACTED-XXX] spent too much time (31 sec) in its execution, other timers can be affected
2022-05-13T19:40:33.942-0500: [GC (Allocation Failure) [PSYoungGen: 8367856K->13920K(8371712K)] 24305989K->15952533K(31440384K), 0.0810411 secs] [Times: user=0.49 sys=1.36, real=0.08 secs] 
2022-05-13T19:41:03.582-0500: [GC (Allocation Failure) [PSYoungGen: 8368224K->14048K(8371200K)] 24306837K->15952837K(31439872K), 0.0920661 secs] [Times: user=0.45 sys=1.52, real=0.10 secs] 
2022-05-14 00:41:22.769+0000 [id=123]    WARNING    hudson.triggers.Trigger#checkTriggers: Trigger hudson.triggers.TimerTrigger.run() triggered by hudson.model.FreeStyleProject@7a59040a[XXX-REDACTED-XXX] spent too much time (33 sec) in its execution, other timers can be affected
2022-05-13T19:41:32.099-0500: [GC (Allocation Failure) [PSYoungGen: 8368352K->14000K(8372224K)] 24307141K->15952989K(31440896K), 0.0732050 secs] [Times: user=0.47 sys=1.19, real=0.08 secs] 
2022-05-14 00:41:56.047+0000 [id=123]    WARNING    hudson.triggers.Trigger#checkTriggers: Trigger hudson.triggers.TimerTrigger.run() triggered by hudson.model.FreeStyleProject@6651d4a9[XXX-REDACTED-XXX] spent too much time (33 sec) in its execution, other timers can be affected
2022-05-13T19:42:02.710-0500: [GC (Allocation Failure) [PSYoungGen: 8369328K->15110K(8371712K)] 24308317K->15954251K(31440384K), 0.0830675 secs] [Times: user=0.61 sys=1.30, real=0.09 secs] 
2022-05-13T19:42:31.207-0500: [GC (Allocation Failure) [PSYoungGen: 8370438K->14624K(8371712K)] 24309579K->15953853K(31440384K), 0.0494647 secs] [Times: user=0.30 sys=0.80, real=0.05 secs] 
2022-05-13T19:43:00.400-0500: [GC (Allocation Failure) [PSYoungGen: 8369440K->14864K(8371712K)] 24308669K->15954245K(31440384K), 0.0970107 secs] [Times: user=0.50 sys=1.61, real=0.10 secs] 
2022-05-13T19:43:30.702-0500: [GC (Allocation Failure) [PSYoungGen: 8369680K->15424K(8371712K)] 24309061K->15954949K(31440384K), 0.1055369 secs] [Times: user=0.48 sys=1.96, real=0.10 secs] 
2022-05-13T19:44:02.834-0500: [GC (Allocation Failure) [PSYoungGen: 8370240K->14368K(8371712K)] 24309765K->15954333K(31440384K), 0.0893574 secs] [Times: user=0.47 sys=1.55, real=0.09 secs] 
2022-05-13T19:44:34.046-0500: [GC (Allocation Failure) [PSYoungGen: 8369184K->14592K(8371712K)] 24309149K->15954790K(31440384K), 0.0817069 secs] [Times: user=0.58 sys=1.29, real=0.08 secs] 
2022-05-13T19:45:02.639-0500: [GC (Allocation Failure) [PSYoungGen: 8369408K->12896K(8371712K)] 24309606K->15954397K(31440384K), 0.0787259 secs] [Times: user=0.48 sys=1.28, real=0.08 secs] 
2022-05-13T19:45:32.443-0500: [GC (Allocation Failure) [PSYoungGen: 8367712K->13334K(8371712K)] 24309213K->15955131K(31440384K), 0.1156263 secs] [Times: user=1.06 sys=1.61, real=0.11 secs] 

This started happening after only 40 minutes or so in quiet-down and about 600 jobs in the queue.

I wish I had more logging for you on this, but it's the lack of any message in this situation that is the tricky part – we simply miss cron timers and never receive job results.

Any idea on what might be going on here?


Originally reported by esnewmanium, imported from: Cron Triggers Being Missed During Quiet-down Mode
  • status: Open
  • priority: Major
  • resolution: Unresolved
  • imported: 2022/01/10
timja commented 2 years ago

basil:

Possibly relates to JENKINS-68319. There is something nasty going on with queue persistence losing entries in a very small percentage of test runs, unreproducible on my system but occasionally cropping up on automated builds. Whatever is eating up the queue entries there is probably eating up your queue entries, too. Unfortunately I am not actively working on this, but if you want to try your hand at a fix you could try running the test case from JENKINS-68319 in a loop, and if you can get it to reproduce easily try to figure out the cause.