mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.38k stars 529 forks source link

chronos getting stuck #417

Open vikhyath opened 9 years ago

vikhyath commented 9 years ago

Hi,

We are trying to get to using chronos in the production environment but while testing it, every so often chronos stops scheduling jobs on mesos. Our environment is Centos 6.4 running Linux 2.6.32-358.el6.x86_64. The logs dont point to anything obvious and both the mesos and chronos UI's are still up. Any pointers?

Thanks guys!

vikhyath commented 9 years ago

Here is some info from the zookeeper log which seems a little off:

2015-04-01 20:21:15,997 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@627] - Got user-level KeeperException when processing sessionid:0x14c76cd82100014 type:create cxid:0x551c6c28 zxid:0x1bff txntype:-1 reqpath:n/a Error Path:/chronos/state/state Error:KeeperErrorCode = NodeExists for /chronos/state/state

vikhyath commented 9 years ago

Sorry I should have probably included this earlier itself.

java -Xmx512m -Djava.library.path=/usr/local/lib:/usr/lib64:/usr/lib -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp /usr/bin/chronos org.apache.mesos.chronos.scheduler.Main --zk_hosts 10.18.179.212:2181 --master zk://10.18.179.212:2181/mesos --http_port 4400

java version "1.7.0_45" OpenJDK Runtime Environment (rhel-2.4.3.3.el6-x86_64 u45-b15) OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

elingg commented 9 years ago

Could you try removing the chronos state in zookeeper by running the zkCli with the command "rmr /chronos/state"?

dangra commented 9 years ago

Same problem here, I tried removing /chronos/state and it is still stuck. It registered as new framework because the frameworkId is stored under /chronos/state/state/frameworkId, now I have 2 chronos registered but only 1 really running.

elingg commented 9 years ago

@dangra, Could you try running curl -d "frameworkId=YOUR_FRAMEWORK_ID" -X POST http://YOUR_MESOS_URL:5050/master/shutdown to shutdown your original Chronos framework?

dangra commented 9 years ago

The old framework is listed as terminated framework now. thanks.

$ curl http://head1:5050/master/shutdown -d frameworkId=20141202-035938-1405352852-5050-24734-0001 -XPOST -v
> POST /master/shutdown HTTP/1.1
> User-Agent: curl/7.35.0
> Host: head1:5050
> Accept: */*
> Content-Length: 54
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 54 out of 54 bytes
< HTTP/1.1 200 OK
< Date: Thu, 16 Apr 2015 20:30:30 GMT
< Content-Length: 0
< 
* Connection #0 to host head1 left intact
dangra commented 9 years ago

I can confirm It started working after reseting /chronos/state and removing old chronos frameworkid from mesos.

elingg commented 9 years ago

Great, any objections to me closing this issue?

dangra commented 9 years ago

none from me, @vikhyath ?

vikhyath commented 9 years ago

@elingg thanks! does this make it a permanent solution or will chronos get stuck again after a while? Is there any insight you can give us, as to why this might be happening?

dangra commented 9 years ago

it stopped working for me once I started the second chronos master.

dangra commented 9 years ago

I have 2 servers running chronos using the same zookeeper uri, both can connect to zookeeper and mesos. Chronos UI at port 4400 show the same list of jobs on both, so I am pretty sure they are configured to use same zookeeper server and mesos master.

ii  chronos                                     2.3.2-0.1.20150207000917.u amd64                      Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedule
ii  mesos                                       0.21.1-1.1.ubuntu1404      amd64                      Cluster resource manager with efficient resource isolation
vikhyath commented 9 years ago

@dangra so you ran into this chronos being stuck state only because you have 2 chronos masters? My situation was chronos with 1 master just stopped scheduling jobs on mesos.

elingg commented 9 years ago

With 2 Chronos masters it makes sense that this could cause an issue. You would want to store your zookeeper state separately. With 1, I'm not clear on how your ZK State got corrupted. Any idea of what triggered this?

dangra commented 9 years ago

@elingg: not sure why I was under the impression Chronos can run in HA mode wiht one master active and the other in standby, if that is not true then I will look at running Chronos with Marathon instead.

elingg commented 9 years ago

@dangra you should be able to run two chronos at once. The parameter you should specify is a different framework name for each chronos instance to make sure your state doesn't get corrupted, see https://github.com/mesos/chronos/blob/79a5459744ab1270d024944cbb4850f8ad30629e/src/main/scala/org/apache/mesos/chronos/scheduler/config/SchedulerConfiguration.scala#L100.

vikhyath commented 9 years ago

@elingg we run into this on a constant basis. 1 chronos + 1 master + zookeeper on the same VM (and about 10 other VM's acting as mesos slaves). Our environment is Centos6.4. Looks like many folks out these use Ubuntu.

dangra commented 9 years ago

I am after another lead here, even with 1 master it is hanging but always after one of my jobs fails. Our original 2 master setup has been running for 3 months without issues until recently that we configured chronos to send emails and one job started failing. Every time I restart chronos it works until this job run and fails. To be clear the mesos task for this job is started fine but the process returns non-zero. As additional info I am using docker containerizers for all my mesos tasks.

vikhyath commented 9 years ago

Did you try exit(1) in the process when something goes wrong?

dangra commented 9 years ago

With 2 Chronos masters it makes sense that this could cause an issue. You would want to store your zookeeper state separately.

But by using different paths per chronos master to store state in zookeeper will defeat the point of running 2 Chronos masters for HA purposes. Isn't it? They will register with different frameworkId and considered as different mesos frameworks too, so if one master is down the other can't takeover looking after the jobs of the former.

When I launch 2 masters using the same ZK uri (zk://host1,host2,host3/chronos) I can see both registering as "candidates":

[zk: localhost:2181(CONNECTED) 14] ls /chronos/state/candidate
[_c_1339d35c-93f5-4728-95ef-dc7e63b7798a-latch-0000000012, _c_3de858b3-f3de-436b-85ff-dd6ae339c1c7-latch-0000000011]
[zk: localhost:2181(CONNECTED) 15] get /chronos/state/candidate/_c_1339d35c-93f5-4728-95ef-dc7e63b7798a-latch-0000000012
head2.mydomain.com:4400
cZxid = 0x800045ea8
ctime = Fri Apr 17 19:36:21 CEST 2015
mZxid = 0x800045ea8
mtime = Fri Apr 17 19:36:21 CEST 2015
pZxid = 0x800045ea8
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x24cb997ff630052
dataLength = 31
numChildren = 0
[zk: localhost:2181(CONNECTED) 16] get /chronos/state/candidate/_c_3de858b3-f3de-436b-85ff-dd6ae339c1c7-latch-0000000011
head1.mydomain.com:4400
cZxid = 0x800033ae1
ctime = Fri Apr 17 04:38:33 CEST 2015
mZxid = 0x800033ae1
mtime = Fri Apr 17 04:38:33 CEST 2015
pZxid = 0x800033ae1
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x14cb998a1ef01d6
dataLength = 31
numChildren = 0

Is this setup possible and the only setting I need to set differently on each master is framework name even if both masters reuses frameworkId stored in zookeeper? I will try although it doesn't fix the original issue.

dangra commented 9 years ago

Did you try exit(1) in the process when something goes wrong?

@vikhyath Why should I try exit(1) if the process is already failing with non-zero?

vikhyath commented 9 years ago

Oops, read it as zero.

elingg commented 9 years ago

@dangra, you are correct that you need to use the same Zookeeper path. I would recommend using different framework names. It was one possibility for the corrupt zookeeper state.

@vikhyath, there was an issue with Chronos jobs running docker stuck in staging, which was recently fixed in Mesos. It seems like you have a different issue, but does this help at all, https://issues.apache.org/jira/browse/MESOS-2583? You might also want to try with a new version of Mesos that has this fix.

dangra commented 9 years ago

In my case Chronos hangs when it can't send emails on job failure. My SMTP server wasn't correctly configured, it was accepting chronos connection but rejecting email sending for the domain I set in my job definition.

I think you can still consider it a bug in Chronos that other healthy jobs doesn't run when SMTP server rejects its emails.

@elingg: thank a lot for the hint on mesos_framework_name, I am using it now.