mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.39k stars 529 forks source link

A partition isolating Chronos from the ZK leader can *not* cause a crash #522

Open aphyr opened 9 years ago

aphyr commented 9 years ago

Per #513, Chronos is expected to crash when a leader loses its Zookeeper connection. In this test case, Chronos detects the loss of its Zookeeper connection and, instead of crashing, sleeps quietly and reconnects when the partition heals. #513 argues that to keep running would violate unspecified correctness constraints. To preserve safety, should Chronos also crash here?

air commented 9 years ago

Hi - you're referring to a statement that doesn't represent the design (it wasn't expressed carefully enough). Please disregard it and refer to the clarification in the thread. Make sense?

air commented 9 years ago

To help us get a better statement on the behaviour we'll check out the logs and see if Chronos is taking both approaches here (self-terminating in some cases and retrying in others). Thanks!

aphyr commented 9 years ago

Please disregard it and refer to the clarification in the thread. Make sense?

No, not really. If you're trying to "take a highly conservative approach, make the fewest assumptions and exit," and "avoid a class of faults by dropping all possibly-outdated state," then you should, you know, actually exit reliably, instead of crashing some but not all of the time. Choosing both failure modes is silly.

air commented 9 years ago

Yep I get you - we'll check it out as I described.

gkleiman commented 8 years ago

I confirm that in the test case, the Chronos leader running on n3 detects connection errors with ZK and, instead of crashing right away, goes into a loop for a little less than 3 minutes trying to reconnect to ZK:

Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,909] INFO Calling next for stream: R23/2015-08-12T05:13:13.356Z/PT59S, jobname: 8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,909] INFO Task ready for scheduling: 2015-08-12T05:13:13.356Z (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:513)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,910] INFO Scheduling:8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:635)
Aug 11 22:13:12 n3 chronos[47085]: [2015-08-11 22:13:12,910] INFO Scheduling task 'ct:1439356393356:0:8:' with delay: '446' (org.apache.mesos.chronos.scheduler.jobs.TaskManager:183)
Aug 11 22:13:13 n3 chronos[47085]: 2015-08-11 22:13:13,425:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:13 n3 chronos[47085]: I0811 22:13:13.425292 47194 group.cpp:418] Lost connection to ZooKeeper, attempting to reconnect ...
Aug 11 22:13:13 n3 chronos[47085]: 2015-08-11 22:13:13,908:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 1ms)
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,094:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,095:47085(0x7f0bea7fc700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,096:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,096:47085(0x7f0bea7fc700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,098:47085(0x7f0bea7fc700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:20 n3 chronos[47085]: 2015-08-11 22:13:20,577:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.14:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.14:2181 timed out (exceeded timeout by 3ms)
Aug 11 22:13:23 n3 chronos[47085]: W0811 22:13:23.435765 47177 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=54f205059630005) expiration
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.436470 47177 group.cpp:472] ZooKeeper session expired
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.436594 47173 detector.cpp:138] Detected a new leader: None
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,436:47085(0x7f0bfb7fe700):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x54f205059630005
Aug 11 22:13:23 n3 chronos[47085]: 
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@716: Client environment:host.name=n3
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@724: Client environment:os.arch=3.16.0-4-amd64
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17)
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@733: Client environment:user.name=root
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@741: Client environment:user.home=/root
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@log_env@753: Client environment:user.dir=/
Aug 11 22:13:23 n3 chronos[47085]: 2015-08-11 22:13:23,437:47085(0x7f0c22f8f700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=n1:2181,n2:2181,n3:2181,n4:2181,n5:2181 sessionTimeout=10000 watcher=0x7f0c30cfb670 sessionId=0 sessionPasswd=<null> context=0x7f0ab4001410 flags=0
Aug 11 22:13:23 n3 chronos[47085]: [2015-08-11 22:13:23,437] WARN Disconnected (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:59)
Aug 11 22:13:23 n3 chronos[47085]: I0811 22:13:23.437671 47175 sched.cpp:276] No master detected
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,247:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,247:47085(0x7f0beb7fe700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,248:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,248:47085(0x7f0beb7fe700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:27 n3 chronos[47085]: 2015-08-11 22:13:27,250:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,243:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.15:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.15:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,243:47085(0x7f0be9ffb700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.13:2181]
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,244:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.13:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,245:47085(0x7f0be9ffb700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.122.12:2181]
Aug 11 22:13:30 n3 chronos[47085]: 2015-08-11 22:13:30,246:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [192.168.122.12:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,721] INFO Client session timed out, have not heard from server in 26680ms for sessionid 0x54f205059630002, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,822] INFO State change: SUSPENDED (org.apache.curator.framework.state.ConnectionStateManager:228)
Aug 11 22:13:30 n3 chronos[47085]: I0811 22:13:30.825243 47219 sched.cpp:1591] Asked to stop the driver
Aug 11 22:13:30 n3 chronos[47085]: I0811 22:13:30.825429 47175 sched.cpp:835] Stopping framework '20150811-220940-192587968-5050-663-0000'
Aug 11 22:13:30 n3 chronos[47085]: [2015-08-11 22:13:30,825] INFO Defeated. Not the current leader. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:583)
Aug 11 22:13:31 n3 chronos[47085]: [2015-08-11 22:13:31,610] INFO Opening socket connection to server n5/192.168.122.15:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:36 n3 chronos[47085]: 2015-08-11 22:13:36,916:47085(0x7f0be9ffb700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 3ms)
Aug 11 22:13:37 n3 chronos[47085]: 2015-08-11 22:13:37,256:47085(0x7f0beb7fe700):ZOO_ERROR@handle_socket_error_msg@1643: Socket [192.168.122.11:2181] zk retcode=-7, errno=110(Connection timed out): connection to 192.168.122.11:2181 timed out (exceeded timeout by 4ms)
Aug 11 22:13:38 n3 chronos[47085]: [2015-08-11 22:13:38,829] INFO Client session timed out, have not heard from server in 8007ms for sessionid 0x54f205059630002, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1096)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,162] INFO Opening socket connection to server n2/192.168.122.12:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,162] INFO Socket connection established to n2/192.168.122.12:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,164] INFO Unable to read additional data from server sessionid 0x54f205059630002, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1098)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,825] INFO Opening socket connection to server n4/192.168.122.14:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 11 22:13:40 n3 chronos[47085]: [2015-08-11 22:13:40,939] ERROR Connection timed out for connection string (n1:2181,n2:2181,n3:2181,n4:2181,n5:2181) and timeout (10000) / elapsed (10106) (org.apache.curator.ConnectionState:201)
Aug 11 22:13:40 n3 chronos[47085]: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:793)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
Aug 11 22:13:40 n3 chronos[47085]: #011at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.FutureTask.run(FutureTask.java:262)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Aug 11 22:13:40 n3 chronos[47085]: #011at java.lang.Thread.run(Thread.java:745)
[...]
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,388] INFO State T_ct:1439356393356:0:8: does not exist yet. Adding to state (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:77)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,418] INFO State update successful: true (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:92)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,419] INFO Saving updated job:ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer()) (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:666)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,419] INFO Persisting job '8' with data 'ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer())' (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:63)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,422] INFO Key for state exists already: J_8 (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:79)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,437] INFO State update successful: true (org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore:92)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,437] INFO Calling next for stream: R22/2015-08-12T05:14:12.356Z/PT59S, jobname: 8 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] INFO Calling next for stream: R23/2015-08-12T05:13:48.719Z/PT56S, jobname: 7 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN Warning, no job found in graph for:7 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] INFO Calling next for stream: R1/2015-08-12T05:13:48.447Z/PT42S, jobname: 6 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,438] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN Warning, no job found in graph for:6 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] INFO Calling next for stream: R13/2015-08-12T05:13:28.555Z/PT38S, jobname: 5 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN Warning, no job found in graph for:5 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,439] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] INFO Calling next for stream: R20/2015-08-12T05:13:24.216Z/PT30S, jobname: 4 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN Warning, no job found in graph for:4 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,440] INFO Calling next for stream: R59/2015-08-12T05:13:45.689Z/PT60S, jobname: 3 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN Warning, no job found in graph for:3 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] INFO Calling next for stream: R1/2015-08-12T05:13:41.396Z/PT47S, jobname: 2 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN Warning, no job found in graph for:2 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,441] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] INFO Calling next for stream: R2/2015-08-12T05:13:15.311Z/PT31S, jobname: 1 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:480)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:489)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN Warning, no job found in graph for:1 (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:490)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,442] WARN ----------------------------------- (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:491)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO Triggering: '8' (org.apache.mesos.chronos.scheduler.jobs.TaskManager:20)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO Removing task mapping (org.apache.mesos.chronos.scheduler.jobs.TaskManager:118)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,865] INFO JobNotificationObserver does not handle JobQueued(ScheduleBasedJob(R22/2015-08-12T05:14:12.356Z/PT59S,8,MEW=$(mktemp -p /tmp/chronos-test/); echo "8" >> $MEW; date -u -Ins >> $MEW; sleep 5; date -u -Ins >> $MEW;,PT22S,2,0,,,2,jepsen@jepsen.io,,,2015-08-12T05:12:20.100Z,,false,0.001,1.0,1.0,false,0,ListBuffer(),false,root,null,UTC,ListBuffer(),true,ListBuffer(),false,false,ListBuffer()),ct:1439356393356:0:8:,0) (org.apache.mesos.chronos.scheduler.jobs.JobsObserver$:27)
Aug 11 22:16:37 n3 chronos[47085]: [2015-08-11 22:16:37,866] INFO Updating state for job (8) to queued (org.apache.mesos.chronos.scheduler.jobs.stats.JobStats:62)
Aug 11 22:17:31 n3 chronos[47085]: [2015-08-11 22:17:31,276] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:17:31 n3 chronos[47085]: [2015-08-11 22:17:31,353] INFO 192.168.122.1 -  -  [12/Aug/2015:05:17:31 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 11 22:17:57 n3 chronos[47085]: [2015-08-11 22:17:57,451] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:17:57 n3 chronos[47085]: [2015-08-11 22:17:57,502] INFO 192.168.122.1 -  -  [12/Aug/2015:05:17:57 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 11 22:18:14 n3 chronos[47085]: [2015-08-11 22:18:14,737] INFO Proxying request to n5:4400 . (org.apache.mesos.chronos.scheduler.api.RedirectFilter:37)
Aug 11 22:18:14 n3 chronos[47085]: [2015-08-11 22:18:14,787] INFO 192.168.122.1 -  -  [12/Aug/2015:05:18:14 +0000] "POST /scheduler/iso8601 HTTP/1.1" 204 0 "-" "Apache-HttpClient/4.5 (Java/1.8.0_40)" (mesosphere.chaos.http.ChaosRequestLog:15)
junjizhi commented 8 years ago

Is this bug resolved? Can someone clarify the status of this thread?

air commented 8 years ago

It's open so the current behaviour remains. This bug isn't critically impacting but is inconsistent with other failure modes. We'd welcome any community eyes to help resolve it.