mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.39k stars 529 forks source link

jobs getting stuck in "Queued" state #569

Open vbajaria opened 9 years ago

vbajaria commented 9 years ago

I have noticed this happen multiple times. A job gets stuck in Queued state but never really runs.

At the same time other jobs get submitted to mesos with no problems.

The only way to fix this problem is to restart chronos, deleting a job and adding it back does not fix it.

I think the delete and add does not work because of the in-memory queue @ https://github.com/mesos/chronos/blob/master/src/main/scala/org/apache/mesos/chronos/scheduler/jobs/TaskManager.scala#L42 ?

Does Chronos not keep the state in zookeeper ?

pegerto commented 9 years ago

Hello,

I am investigating an issue with the same symptoms, will try to restart chronos to see if this solve my issue.

pegerto commented 9 years ago

Hi @vbajaria

Restart chronos solve my issue with chronos 2.3.3

I couldn't find any evidence why the task keep in queue state.

TheRockyOng commented 9 years ago

I'm experience the exact same issue and restarting didn't really help. Was there any known workaround on this? I couldn't find any helpful log either....

vbajaria commented 9 years ago

@pegerto where did you look for evidence ? I think the issue could be due to 2 things (though I can't vouch for it) 1) mesos <==> chronos communication could have some problem and the job stays in "queued" state forever. 2) Some weird timeout or network communication when chronos tries to submit the job to mesos once an offer is accepted.

Again speculating since I have not dived deeper.

deedubs commented 9 years ago

We're experiencing the same on chronos 2.4.0-0.1.20150828104228.ubuntu1404 and mesos 0.24.1-0.2.35.ubuntu1404. Jobs sitting in queued state with lots of available resources. Only restarting chronos solves the issue.

Time for a Chronos job to routinely restart chronos?

robertkhchan commented 9 years ago

Any update on this? I'm experiencing the same issue on chronos-2.3.4 and mesos 0.23.0.

tomwganem commented 9 years ago

I'm also experiencing this. I can submit a job and it never runs, either on its schedule or when trying to run it manually. We are running mesos 0.22.1 and chronos 2.3.4.

However, even restarting chronos doesn't seem to fix the issue. I've had a little more luck with clearing out the zookeeper node, but as far as viable workarounds go, that one sucks.

edit: Not even restarting chronos fixes the issue for me.

vbajaria commented 9 years ago

I am surprised the restart does not fix it. It has always fixed it for me.

I am sure the issue happens due to an in-memory queue and some random race condition but I haven't had a chance to repro it.

JohnOmernik commented 8 years ago

I had this happen over the weekend. We did a forcible Mesos Master switch. (we had a mesos master that had a failing hardrive, it was working, but we wanted to control and monitor the failover, so we stopped the master server, it failed over as intended.) However after failover, jobs hung in a queued state until we restarted Chronos.

deedubs commented 8 years ago

Our "solution" to this point was to use https://github.com/massiveco/ananke to get job stats into prometheus and detect when jobs hadn't run on schedule. We then automatically restart chronos :grimacing: .

It's not pretty but it keeps our jobs running.

orlandohohmeier commented 8 years ago

Thanks for reporting this! To narrow down the problem, it would be great if you could provide some details regarding your setup and the actual jobs that stuck. Do only specific jobs stuck (e.g. due to some constraints)? Does this happen randomly or always after a given time?

ekesken commented 8 years ago

I had the same problem but restarting chronos did not solve my problem too, after investigating my logs I found that a marathon framework task was assigned to chronos framework wrongly at one of slaves, at caused deactivation of framework, when we try to restart chronos master followings logs occured:

Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: Exception in thread "Thread-62550" scala.MatchError: collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 (of class java.lang.String)
Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at org.apache.mesos.chronos.scheduler.jobs.TaskUtils$.parseTaskId(TaskUtils.scala:132)
Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:215)
Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at java.lang.reflect.Method.invoke(Method.java:497)
Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37)
Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at com.sun.proxy.$Proxy31.statusUpdate(Unknown Source)
Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238782 15035 sched.cpp:1623] Asked to abort the driver
Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238836 15035 sched.cpp:856] Aborting framework '20150624-210230-117448108-5050-3678-0001'

which deactivated chronos framework again. you can check my logs from following gist: https://gist.github.com/ekesken/f2edfd65cca8638b0136

I stopped mesos-slave, removed /tmp/mesos folder and started it again. Then restarting chronos-master to reactivate framework, I don't know if there is a better way for reactivation.

I filled a bug report for mesos as well: https://issues.apache.org/jira/browse/MESOS-4084

grepsr commented 8 years ago

For some reason, this happens to me on my datacenter every Saturday, which is weird.

Any idea when this can be solved? If someone know what might be going on, I would love to take a stab at it.

Kulikowski commented 8 years ago

I am seeing the same problem - my jobs are stuck in "queued" state when I am using constraints, is there any fix for this?

wh88725 commented 8 years ago

I am seeing the same problem every job stucks in "queued" state, and no one can be run. i check the mesos-master log and all mesos-slaves log. no chronos jobs are submitted. to restart chronos can not fix it. And i also check the chronos log and no exception or error occurs I guess the mesos <==> chronos communication may have some problem, but I can find the chronos framework in the mesos ui. I can not find any log to figure out the communication problem.

RockScience commented 8 years ago

That is quite an issue for a "fault tolerant" scheduler. Is there someone looking into that?

planenutz commented 8 years ago

How many Chronos nodes are in your cluster? Can you determine if the Chronos node that's got stuck jobs is the leader?

subratbasnet commented 8 years ago

I had to move away to Singularity mesos framework just because I could not solve this issue for myself. Hope others have more luck!

dlsuzuki commented 8 years ago

I think I've encountered this issue a couple of times. Thankfully it doesn't seem to happen during normal operations, but if a job fails and I try to resubmit it for execution, then sometimes it won't get picked up by Mesos. Restarting Chronos fixed the problem.

ghost commented 8 years ago

Is there any workaround available? I have setup my mesos-master and chronos on ubuntu 14.04. Jobs are stuck in queued state forever. I tried restarting the chronos and it didn't help either.

planenutz commented 8 years ago

Sounds like you don't have any slave nodes to service the job. That will definitely cause jobs to be stuck in queue.

ghost commented 8 years ago

No. While having slaves also jobs are not getting executed.


From: Jim Thames notifications@github.com Sent: Thursday, April 28, 2016 8:43:12 PM To: mesos/chronos Cc: Saravanan Balakrishnan; Comment Subject: Re: [mesos/chronos] jobs getting stuck in "Queued" state (#569)

Sounds like you don't have any slave nodes to service the job. That will definitely cause jobs to be stuck in queue.

You are receiving this because you commented. Reply to this email directly or view it on GitHubhttps://github.com/mesos/chronos/issues/569#issuecomment-215459015

debugger87 commented 8 years ago

I'm seeing the same problem and can not re-produce it.

dlsuzuki commented 8 years ago

In my Mesos/Chronos clusters, I'm seeing this sort of behavior whenever something causes the Mesos masters to restart. The Chronos framework stops processing any jobs (they stay Pending) and the framework re-registers about once per minute. The Mesos master logs something like the following right before the re-registration:

E0714 12:52:30.188627 2589 process.cpp:1958] Failed to shutdown socket with fd 35: Transport endpoint is not connected

It doesn't look like I need to shut down the entire Mesos master cluster (or even break quorum) for this to happen. I just tried manually stopping the primary master of a three-node cluster, and I immediately started seeing the endpoint errors on the new primary.

dlsuzuki commented 8 years ago

Oh, looks like my problem is tied to https://github.com/mesos/chronos/issues/480. So I guess I'm waiting for 2.5.0.

dlsuzuki commented 8 years ago

I noticed that this problem can happen even without the primary master going down, if the load on its host gets high enough. As a workaround, I have put together a toolset by which Nagios probes the cluster to determine whether the Chronos framework has re-registered in the past 70 seconds. If that state persists for four minutes, Nagios fires off an Ansible playbook that restarts all of the Chronos nodes. Seems to work well in testing, and at least we'll know when the problem crops up.

dlsuzuki commented 7 years ago

I'm doing some initial testing with Mesos 1.0.1 and Chronos 3.0.1. Three nodes, each with a mesos-master and a chronos instance (the latter in a Docker container). It looks like this particular problem is behaving the same as with Chronos 2.4.0. If the original Chronos leader goes down, one of the others takes over but the framework becomes inactive. Only if the original leader becomes the leader does the framework reactivate. To change which leader actually works, it seems that I have to restart the Mesos masters.

eyalzek commented 7 years ago

Looks like I'm hitting the same issue, running chronos 3.0.1 and mesos 1.1.0...

image

code-haven commented 7 years ago

Had a similar issue with 2.4.0. New tasks were stuck in the 'queued' state. After spending some time debugging, our team found that we were passing 'runAsUser = null' while creating the chronos task, which tells chronos under which user you want to run the command as.

When we created a new task with the same payload, but with 'runAsuser = root', the tasks were getting scheduled.

eyalzek commented 7 years ago

This happened again and I just noticed that on the mesos-master UI, I didn't have any Idle resources, only Offered. I wonder if this is a symptom or the cause? since we have 2 registered framework in mesos, this can go either way. Either all the resources were offered to the other framework since chronos was in a bad state, or chronos got in a bad state because mesos wasn't offering any resources...