rhettg / Tron

Next generation batch process scheduling and management
Other
11 stars 0 forks source link

Strange behavior for stuck jobs after reconfig #72

Open rhettg opened 12 years ago

rhettg commented 12 years ago

More fun with reconfig issues.

On friday, at 18:06, a configuration change moved two interval scheduled jobs between nodes: Change:

commit 79aa7ffd4dfa0a95335de09748a1625719acd8c7
Author: bryce 
Date:   Fri Jul 1 18:06:37 2011 -0700

    Move authentify and biz_self_serve_setup to batch4

diff --git a/tron_config.yaml b/tron_config.yaml
index 5c8dab7..1d291eb 100644
--- a/tron_config.yaml
+++ b/tron_config.yaml
@@ -570,7 +570,7 @@ jobs:
     # Other non-gearman worker services
     - !Job
         name: "authentify_session"
-        node: *batch3
+        node: *batch4
         run_limit: 5
         schedule: !IntervalScheduler
             interval: "2 mins"
@@ -581,7 +581,7 @@ jobs:

     - !Job
         name: "biz_self_serve_setup"
-        node: *batch3
+        node: *batch4
         run_limit: 5
         schedule: !IntervalScheduler
             interval: "2 mins"

Logs:

2011-07-01 18:06:36,825 tron.www INFO Handling reconfig request
2011-07-01 18:06:36,826 tron.mcp INFO Loading configuration from /nail/tron/tron_config.yaml
2011-07-01 18:06:37,563 tron.mcp INFO re-adding job authentify_session
2011-07-01 18:06:37,564 tron.mcp INFO re-adding job biz_self_serve_setup

A minute later, it looks like Bryce manually issued a start

2011-07-01 18:07:10,790 tron.www INFO Handling 'start' request for job run biz_self_serve_setup
2011-07-01 18:07:10,790 tron.job INFO Built run biz_self_serve_setup.148981

The job was then stuck in the "UNKNOWN" state for the rest of the weekend. tronview looked like:

Run History: (5 total)
Run ID  State  Node                 Scheduled Time      
.148981 QUE    batch4               2011-07-01 18:07:10 
        Start: -  End: -  (-)
.148979 SUCC   batch3               2011-07-01 18:05:41 
        Start: 2011-07-01 18:05:41  End: 2011-07-01 18:05:45  (0:00:04)

Enabling and disabling the job cleared the issue and started the job normally.