Open harjinder-flipkart opened 4 years ago
Based upon recent investigation, I have updated the problem description above.
Chronos team, can you please help us resolve the issue.
I have kept Chronos thread dump here.
Relevant threads look like this: ... ` "Thread-264485" #264523 prio=5 os_prio=0 tid=0x00007fd9d4006800 nid=0x5fb9 waiting for monitor entry [0x00007fda1c9da000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.replaceJob(JobScheduler.scala:152)
` "pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000] java.lang.Thread.State: RUNNABLE at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method) at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.liftedTree1$1(JobScheduler.scala:542) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.mainLoop(JobScheduler.scala:540)
@brndnmtthws can you please look into this issue ?
@harjinder-flipkart I haven't been involved with this project in years, so I'm not really in a position to help. Good luck with your debugging.
Can you send mesos state JSON?
State JSON for mesos master is here: https://gist.github.com/harjinder-flipkart/58f1dfc8e077ee9a80f1b544cf87ff4c
I suspect chronos is stuck with single offer. Have you tried restarting it? It might be helpful to set offer_timeout
on Mesos Master.
Thanks @janisz for your reply !
Yes restarting Chronos and ZK brings the cluster back in working condition. Restarting chronos/zk is a work-around for the time being. But we are looking for a permanent solution and need your help :)
Also, I am not sure if Chronos was stuck with single offer. The thread dump shows that Chronos thread was trying to load jobs and it was waiting for ZK:
...
"pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000]
java.lang.Thread.State: RUNNABLE
at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method)
at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226)
at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68)
...
@janisz any pointers for this ?
Intermittent Chronos Issue: At our Chronos cluster, we have been encountering an intermittent issue where Chronos jobs stop getting executed on Mesos. The sequence of observed events is as follows:
Whys:
Software Versions: