Chronos Intermittent Issue: Jobs get stuck

harjinder-flipkart commented 4 years ago

Intermittent Chronos Issue: At our Chronos cluster, we have been encountering an intermittent issue where Chronos jobs stop getting executed on Mesos. The sequence of observed events is as follows:

Chronos jobs are not executed by Mesos.
Status of jobs on Chronos dashboard is ‘Queued’.
Mesos master logs show that -- master has not been sending resource offers to framework i.e. Chronos. -- master keeps getting update from slaves for old tasks. -- it keeps trying to forward the update to chronos. -- Zookeeper and slaves are not down. They are working fine.
After restarting Chronos and Zookeeper, the system starts working fine. Chronos jobs start getting executed.

Whys:

Why Chronos jobs stop getting executed ? Chronos, as a Mesos application (framework), waits for resource offers from Mesos master. Mesos master generally sends resource offers at a very high frequency i.e. 100 ms to a few seconds. However, in this case, the master stopped sending resource offers. Without these resource offers, Chronos is stuck.
Why Mesos master stopped sending resource offers ? The mesos slaves were occupied with FINISHED tasks. Mesos slaves were telling the master that taks is FINISHED and the master was trying to tell Chronos leader the same and waiting for ACK. Chronos was not sending ACK.
Why did Chronos not send ACK ? The "JobScheduler::handleFinishedTask" thread in Chronos leader was waiting on ReentrantLock which was held by the "JobScheduler::mainLoop" thread.
Why did "JobScheduler::mainLoop" thread not release the lock ? The mainLoop thread is trying to reload jobs from ZK and it is blocked on ZK.

Software Versions:

Chronos 3.0.3
Mesos 1.4.0
Zookeeper 3.4.5

harjinder-flipkart commented 4 years ago

Based upon recent investigation, I have updated the problem description above.

Chronos team, can you please help us resolve the issue.

harjinder-flipkart commented 4 years ago

I have kept Chronos thread dump here.

Relevant threads look like this: ... ` "Thread-264485" #264523 prio=5 os_prio=0 tid=0x00007fd9d4006800 nid=0x5fb9 waiting for monitor entry [0x00007fda1c9da000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.replaceJob(JobScheduler.scala:152)

waiting to lock <0x00000007042d73d0> (a java.util.concurrent.locks.ReentrantLock) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.handleFinishedTask(JobScheduler.scala:244) at org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:210) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37) at com.sun.proxy.$Proxy30.statusUpdate(Unknown Source) ` ...

` "pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000] java.lang.Thread.State: RUNNABLE at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method) at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.liftedTree1$1(JobScheduler.scala:542) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.mainLoop(JobScheduler.scala:540)

locked <0x00000007042d73d0> (a java.util.concurrent.locks.ReentrantLock) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anon$1.run(JobScheduler.scala:516) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) `

harjinder-flipkart commented 4 years ago

@brndnmtthws can you please look into this issue ?

brndnmtthws commented 4 years ago

@harjinder-flipkart I haven't been involved with this project in years, so I'm not really in a position to help. Good luck with your debugging.

janisz commented 4 years ago

Can you send mesos state JSON?

harjinder-flipkart commented 4 years ago

State JSON for mesos master is here: https://gist.github.com/harjinder-flipkart/58f1dfc8e077ee9a80f1b544cf87ff4c

janisz commented 4 years ago

I suspect chronos is stuck with single offer. Have you tried restarting it? It might be helpful to set offer_timeout on Mesos Master.

harjinder-flipkart commented 4 years ago

Thanks @janisz for your reply !

Yes restarting Chronos and ZK brings the cluster back in working condition. Restarting chronos/zk is a work-around for the time being. But we are looking for a permanent solution and need your help :)

Also, I am not sure if Chronos was stuck with single offer. The thread dump shows that Chronos thread was trying to load jobs and it was waiting for ZK:

...
"pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000]
   java.lang.Thread.State: RUNNABLE
    at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method)
    at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226)
    at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
    at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68)
...

harjinder-flipkart commented 4 years ago

@janisz any pointers for this ?

mesos / chronos

Chronos Intermittent Issue: Jobs get stuck #897