Open aphyr opened 9 years ago
Hi - you're referring to a statement that doesn't represent the design (it wasn't expressed carefully enough). Please disregard it and refer to the clarification in the thread. Make sense?
To help us get a better statement on the behaviour we'll check out the logs and see if Chronos is taking both approaches here (self-terminating in some cases and retrying in others). Thanks!
Please disregard it and refer to the clarification in the thread. Make sense?
No, not really. If you're trying to "take a highly conservative approach, make the fewest assumptions and exit," and "avoid a class of faults by dropping all possibly-outdated state," then you should, you know, actually exit reliably, instead of crashing some but not all of the time. Choosing both failure modes is silly.
Yep I get you - we'll check it out as I described.
I confirm that in the test case, the Chronos leader running on n3 detects connection errors with ZK and, instead of crashing right away, goes into a loop for a little less than 3 minutes trying to reconnect to ZK:
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,909] INFO Calling next for stream: R23/2015-08-12T05:13:13.356Z/PT59S, jobname: 8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,909] INFO Task ready for scheduling: 2015-08-12T05:13:13.356Z (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:513)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,910] INFO Scheduling:8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:635)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,910] INFO Scheduling task 'ct:1439356393356:0:8:' with delay: '446' (org.apache.mesos.chronos.scheduler.jobs.TaskManager:183)
Aug 11 22:13:13 n3 chronos[47085]: 2015-08-11 22:13:13,425:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:13 n3 chronos[47085]: I0811 22:13:13.425292 47194 group.cpp:418] Lost connection to ZooKeeper, attempting to reconnect ...
Aug 11 22:13:13 n3 chronos[47085]: 2015-08-11 22:13:13,908:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 1ms)
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,094:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,095:47085(0x7f0bea7fc700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,096:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,096:47085(0x7f0bea7fc700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,098:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,577:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.14:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.14:2181 timed out (exceeded timeout by 3ms)
Aug 11 22:13:23 n3 chronos[47085]: W0811 22:13:23.435765 47177 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=54f205059630005) expiration
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.436470 47177 group.cpp:472] ZooKeeper session expired
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.436594 47173 detector.cpp:138] Detected a new leader: None
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,436:47085(0x7f0bfb7fe700):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x54f205059630005
Aug 11 22:13:23 n3 chronos[47085]:
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@716: Client environment:host.name=n3
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@724: Client environment:os.arch=3.16.0-4-amd64
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17)
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@733: Client environment:user.name=root
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@741: Client environment:user.home=/root
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@753: Client environment:user.dir=/
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=n1:2181,n2:2181,n3:2181,n4:2181,n5:2181 sessionTimeout=10000 watcher=0x7f0c30cfb670 sessionId=0 sessionPasswd=<null> context=0x7f0ab4001410 flags=0
Aug 11 22:13:23 n3 chronos[47085]: [2015-08-11 22:13:23,437] WARN Disconnected (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:59)
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.437671 47175 sched.cpp:276] No master detected
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,247:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,247:47085(0x7f0beb7fe700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,248:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,248:47085(0x7f0beb7fe700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,250:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,243:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,243:47085(0x7f0be9ffb700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,244:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,245:47085(0x7f0be9ffb700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,246:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,721] INFO Client session timed out, have not heard from server in 26680ms for sessionid 0x54f205059630002, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,822] INFO State change: SUSPENDED (org.apache.curator.framework.state.ConnectionStateManager:228)
Aug 11 22:13:30 n3 chronos[47085]: I0811 22:13:30.825243 47219 sched.cpp:1591] Asked to stop the driver
Aug 11 22:13:30 n3 chronos[47085]: I0811 22:13:30.825429 47175 sched.cpp:835] Stopping framework '20150811-220940-192587968-5050-663-0000'
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,825] INFO Defeated. Not the current leader. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:583)
Aug 11 22:13:31 n3 chronos[47085]: [2015-08-11 22:13:31,610] INFO Opening socket connection to server n5/192.168.122.15:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:36 n3 chronos[47085]: 2015-08-11 22:13:36,916:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 3ms)
Aug 11 22:13:37 n3 chronos[47085]: 2015-08-11 22:13:37,256:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:38 n3 chronos[47085]: [2015-08-11 22:13:38,829] INFO Client session timed out, have not heard from server in 8007ms for sessionid 0x54f205059630002, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,162] INFO Opening socket connection to server n2/192.168.122.12:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,162] INFO Socket connection established to n2/192.168.122.12:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,164] INFO Unable to read additional data from server sessionid 0x54f205059630002, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1098)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,825] INFO Opening socket connection to server n4/192.168.122.14:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,939] ERROR Connection timed out for connection string (n1:2181,n2:2181,n3:2181,n4:2181,n5:2181) and timeout (10000) / elapsed (10106) (org.apache.curator.ConnectionState:201)
Aug 11 22:13:40 n3 chronos[47085]: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:793)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.FutureTask.run(FutureTask.java:262)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.lang.Thread.run(Thread.java:745)
[...]
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,388] INFO State T_ct:1439356393356:0:8: does not exist yet. Adding to state (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:77)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,418] INFO State update successful: true (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:92)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,419] INFO Saving updated job:ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer()) (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:666)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,419] INFO Persisting job '8' with data 'ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer())' (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:63)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,422] INFO Key for state exists already: J_8 (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:79)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,437] INFO State update successful: true (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:92)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,437] INFO Calling next for stream: R22/2015-08-12T05:14:12.356Z/PT59S, jobname: 8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] INFO Calling next for stream: R23/2015-08-12T05:13:48.719Z/PT56S, jobname: 7 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN Warning, no job found in graph for:7 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] INFO Calling next for stream: R1/2015-08-12T05:13:48.447Z/PT42S, jobname: 6 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN Warning, no job found in graph for:6 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] INFO Calling next for stream: R13/2015-08-12T05:13:28.555Z/PT38S, jobname: 5 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN Warning, no job found in graph for:5 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] INFO Calling next for stream: R20/2015-08-12T05:13:24.216Z/PT30S, jobname: 4 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN Warning, no job found in graph for:4 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] INFO Calling next for stream: R59/2015-08-12T05:13:45.689Z/PT60S, jobname: 3 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN Warning, no job found in graph for:3 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] INFO Calling next for stream: R1/2015-08-12T05:13:41.396Z/PT47S, jobname: 2 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN Warning, no job found in graph for:2 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] INFO Calling next for stream: R2/2015-08-12T05:13:15.311Z/PT31S, jobname: 1 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN Warning, no job found in graph for:1 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO Triggering: '8' (org.apache.mesos.chronos.scheduler.jobs.TaskManager:20)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO Removing task mapping (org.apache.mesos.chronos.scheduler.jobs.TaskManager:118)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO JobNotificationObserver does not handle JobQueued(ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer()),ct:1439356393356:0:8:,0) (org.apache.mesos.chronos.scheduler.jobs.JobsObserver$:27)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,866] INFO Updating state for job (8) to queued (org.apache.mesos.chronos.scheduler.jobs.stats.JobStats:62)
Aug 11 22:17:31 n3 chronos[47085]: [2015-08-11 22:17:31,276] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:17:31 n3 chronos[47085]: [2015-08-11 22:17:31,353] INFO 192.168.122.1 - - [12/Aug/2015:05:17:31 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 11 22:17:57 n3 chronos[47085]: [2015-08-11 22:17:57,451] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:17:57 n3 chronos[47085]: [2015-08-11 22:17:57,502] INFO 192.168.122.1 - - [12/Aug/2015:05:17:57 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 11 22:18:14 n3 chronos[47085]: [2015-08-11 22:18:14,737] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:18:14 n3 chronos[47085]: [2015-08-11 22:18:14,787] INFO 192.168.122.1 - - [12/Aug/2015:05:18:14 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)
Is this bug resolved? Can someone clarify the status of this thread?
It's open so the current behaviour remains. This bug isn't critically impacting but is inconsistent with other failure modes. We'd welcome any community eyes to help resolve it.
Per #513, Chronos is expected to crash when a leader loses its Zookeeper connection. In this test case, Chronos detects the loss of its Zookeeper connection and, instead of crashing, sleeps quietly and reconnects when the partition heals. #513 argues that to keep running would violate unspecified correctness constraints. To preserve safety, should Chronos also crash here?