mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.39k stars 529 forks source link

Chronos Intermittent Issue: Jobs get stuck #897

Open harjinder-flipkart opened 4 years ago

harjinder-flipkart commented 4 years ago

Intermittent Chronos Issue: At our Chronos cluster, we have been encountering an intermittent issue where Chronos jobs stop getting executed on Mesos. The sequence of observed events is as follows:

Whys:

Software Versions:

harjinder-flipkart commented 4 years ago

Based upon recent investigation, I have updated the problem description above.

Chronos team, can you please help us resolve the issue.

harjinder-flipkart commented 4 years ago

I have kept Chronos thread dump here.

Relevant threads look like this: ... ` "Thread-264485" #264523 prio=5 os_prio=0 tid=0x00007fd9d4006800 nid=0x5fb9 waiting for monitor entry [0x00007fda1c9da000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.replaceJob(JobScheduler.scala:152)

` "pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000] java.lang.Thread.State: RUNNABLE at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method) at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.liftedTree1$1(JobScheduler.scala:542) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.mainLoop(JobScheduler.scala:540)

harjinder-flipkart commented 4 years ago

@brndnmtthws can you please look into this issue ?

brndnmtthws commented 4 years ago

@harjinder-flipkart I haven't been involved with this project in years, so I'm not really in a position to help. Good luck with your debugging.

janisz commented 4 years ago

Can you send mesos state JSON?

harjinder-flipkart commented 4 years ago

State JSON for mesos master is here: https://gist.github.com/harjinder-flipkart/58f1dfc8e077ee9a80f1b544cf87ff4c

janisz commented 4 years ago

I suspect chronos is stuck with single offer. Have you tried restarting it? It might be helpful to set offer_timeout on Mesos Master.

harjinder-flipkart commented 4 years ago

Thanks @janisz for your reply !

Yes restarting Chronos and ZK brings the cluster back in working condition. Restarting chronos/zk is a work-around for the time being. But we are looking for a permanent solution and need your help :)

Also, I am not sure if Chronos was stuck with single offer. The thread dump shows that Chronos thread was trying to load jobs and it was waiting for ZK:

...
"pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000]
   java.lang.Thread.State: RUNNABLE
    at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method)
    at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226)
    at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
    at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68)
...
harjinder-flipkart commented 4 years ago

@janisz any pointers for this ?