mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.39k stars 529 forks source link

unable to run job on cluster but ok using local_scheduler_nozk.yml #30

Closed bentrigat closed 11 years ago

bentrigat commented 11 years ago

hello , am able to run 1 simple job (simple ls) when i use the provided local_scheduler_nozk.yml but when i switch to a 2 nodes cluster config (1 node with mesos master+chronos+zookeeper and 1 node as a simple mesos slave) the job is registered but never run. no trace of running in the logs (although using TRACE level) and no output for my job on any of the 2 nodes cluster , the log dont show launching of the job nor TASK_RUNNING or TASK_FAILURE i expected either. see gist for details about the job 'test7'. note i see the job in the chronos ui and it does not show either success nor error. also i made sure to use UTC timestamp ;) i also see the mesos master and slave in the mesos ui. /tmp/mesos/slave on the mesos-master/chronos server is empty the mesos tests at https://github.com/apache/mesos/blob/trunk/docs/Home.md all pass (java/python/c++). tried to submit from both chronos web ui and the chronos/curl cli to no avail.

please see https://gist.github.com/bentrigat/5503859

bentrigat commented 11 years ago

submitting similar job manually:

curl -L -X GET localhost:4400/scheduler/jobs [{"schedule":"R/2013-05-02T22:52:30Z/PT24H","name":"testjob14","command":"ls -aRtl /tmp >> /tmp/ben/f","epsilon":"PT15M","successCount":0,"errorCount":0,"executor":"","executorFlags":"","retries":2,"owner":"ben@toto.com","lastSuccess":"","lastError":"","async":false}]

curl -L -X PUT localhost:4400/scheduler/job/testjob14

generates this in the log :

INFO [2013-05-02 22:54:43,543] com.airbnb.scheduler.api.JobManagementResource: Manually triggering job:testjob14 DEBUG [2013-05-02 22:54:43,543] com.airbnb.scheduler.jobs.TaskManager: Adding task 'ct:testjob14:1367535283543:0' to queue 127.0.0.1 - - [02/May/2013:22:54:43 +0000] "PUT /scheduler/job/testjob14 HTTP/1.1" 204 0 3 3 DEBUG [2013-05-02 22:54:43,544] org.eclipse.jetty.server.Server: RESPONSE /scheduler/job/testjob14 204 handled=true DEBUG [2013-05-02 22:54:43,547] org.eclipse.jetty.io.nio.ChannelEndPoint: ishut BCEP@138f4128{l(/127.0.0.1:50128)<->r(/127.0.0.1:4400),open=true,ishut=false,oshut=false}-{BlockingHttpConnection@76949cfe,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=-3},r=1} DEBUG [2013-05-02 22:54:43,547] org.eclipse.jetty.http.HttpParser: filled -1/0 DEBUG [2013-05-02 22:54:43,547] org.eclipse.jetty.io.nio.ChannelEndPoint: close BCEP@138f4128{l(/127.0.0.1:50128)<->r(/127.0.0.1:4400),open=true,ishut=true,oshut=false}-{BlockingHttpConnection@76949cfe,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=0,l=0,c=-3},r=1} DEBUG [2013-05-02 22:54:43,555] org.eclipse.jetty.server.AbstractHttpConnection: closed BlockingHttpConnection@76949cfe,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=0,l=0,c=-3},r=1 DEBUG [2013-05-02 22:54:44,523] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:54:46,192] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:54:47,860] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:54:49,527] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:54:51,196] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:54:52,862] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:54:53,337] org.eclipse.jetty.http.HttpParser: filled 1262/1262 DEBUG [2013-05-02 22:54:53,338] org.eclipse.jetty.server.Server: REQUEST /scheduler/jobs on BlockingHttpConnection@55eb75d3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-5,l=617,c=0},r=46 DEBUG [2013-05-02 22:54:53,338] org.eclipse.jetty.server.handler.ContextHandler: scope null||/scheduler/jobs @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:54:53,338] org.eclipse.jetty.server.handler.ContextHandler: scope null||/scheduler/jobs @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:54:53,338] org.eclipse.jetty.server.handler.ContextHandler: context=||/scheduler/jobs @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:54:53,338] org.eclipse.jetty.servlet.ServletHandler: servlet |/scheduler|/jobs -> com.sun.jersey.spi.container.servlet.ServletContainer-723468318 DEBUG [2013-05-02 22:54:53,338] org.eclipse.jetty.servlet.ServletHandler: chain=com.yammer.dropwizard.servlets.ThreadNameFilter-664071981->com.airbnb.scheduler.api.RedirectFilter-1583883651->com.sun.jersey.spi.container.servlet.ServletContainer-723468318 DEBUG [2013-05-02 22:54:53,338] org.eclipse.jetty.servlet.ServletHandler: call filter com.yammer.dropwizard.servlets.ThreadNameFilter-664071981 DEBUG [2013-05-02 22:54:53,338] org.eclipse.jetty.servlet.ServletHandler: call filter com.airbnb.scheduler.api.RedirectFilter-1583883651 DEBUG [2013-05-02 22:54:53,338] org.eclipse.jetty.servlet.ServletHandler: call servlet com.sun.jersey.spi.container.servlet.ServletContainer-723468318 127.0.0.1 - - [02/May/2013:22:54:53 +0000] "GET /scheduler/jobs HTTP/1.1" 200 218 5 5 DEBUG [2013-05-02 22:54:53,342] org.eclipse.jetty.server.Server: RESPONSE /scheduler/jobs 200 handled=true DEBUG [2013-05-02 22:54:53,413] org.eclipse.jetty.http.HttpParser: filled 1244/1244 DEBUG [2013-05-02 22:54:53,413] org.eclipse.jetty.server.Server: REQUEST /scheduler/graph/csv on BlockingHttpConnection@4db991a8,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-5,l=617,c=0},r=144 DEBUG [2013-05-02 22:54:53,413] org.eclipse.jetty.server.handler.ContextHandler: scope null||/scheduler/graph/csv @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:54:53,413] org.eclipse.jetty.server.handler.ContextHandler: scope null||/scheduler/graph/csv @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:54:53,413] org.eclipse.jetty.server.handler.ContextHandler: context=||/scheduler/graph/csv @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:54:53,413] org.eclipse.jetty.servlet.ServletHandler: servlet |/scheduler|/graph/csv -> com.sun.jersey.spi.container.servlet.ServletContainer-723468318 DEBUG [2013-05-02 22:54:53,413] org.eclipse.jetty.servlet.ServletHandler: chain=com.yammer.dropwizard.servlets.ThreadNameFilter-664071981->com.airbnb.scheduler.api.RedirectFilter-1583883651->com.sun.jersey.spi.container.servlet.ServletContainer-723468318 DEBUG [2013-05-02 22:54:53,413] org.eclipse.jetty.servlet.ServletHandler: call filter com.yammer.dropwizard.servlets.ThreadNameFilter-664071981 DEBUG [2013-05-02 22:54:53,414] org.eclipse.jetty.servlet.ServletHandler: call filter com.airbnb.scheduler.api.RedirectFilter-1583883651 DEBUG [2013-05-02 22:54:53,414] org.eclipse.jetty.servlet.ServletHandler: call servlet com.sun.jersey.spi.container.servlet.ServletContainer-723468318 127.0.0.1 - - [02/May/2013:22:54:53 +0000] "GET /scheduler/graph/csv HTTP/1.1" 200 41 3 3 DEBUG [2013-05-02 22:54:53,417] org.eclipse.jetty.server.Server: RESPONSE /scheduler/graph/csv 200 handled=true DEBUG [2013-05-02 22:54:54,530] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:54:56,197] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:54:57,866] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:54:59,533] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:55:01,202] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:55:02,869] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:55:03,339] org.eclipse.jetty.http.HttpParser: filled 1262/1262 DEBUG [2013-05-02 22:55:03,340] org.eclipse.jetty.server.Server: REQUEST /scheduler/jobs on BlockingHttpConnection@4db991a8,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-5,l=617,c=0},r=145 DEBUG [2013-05-02 22:55:03,340] org.eclipse.jetty.server.handler.ContextHandler: scope null||/scheduler/jobs @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:55:03,340] org.eclipse.jetty.server.handler.ContextHandler: scope null||/scheduler/jobs @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:55:03,340] org.eclipse.jetty.server.handler.ContextHandler: context=||/scheduler/jobs @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:55:03,341] org.eclipse.jetty.servlet.ServletHandler: servlet |/scheduler|/jobs -> com.sun.jersey.spi.container.servlet.ServletContainer-723468318 DEBUG [2013-05-02 22:55:03,341] org.eclipse.jetty.servlet.ServletHandler: chain=com.yammer.dropwizard.servlets.ThreadNameFilter-664071981->com.airbnb.scheduler.api.RedirectFilter-1583883651->com.sun.jersey.spi.container.servlet.ServletContainer-723468318 DEBUG [2013-05-02 22:55:03,341] org.eclipse.jetty.servlet.ServletHandler: call filter com.yammer.dropwizard.servlets.ThreadNameFilter-664071981 DEBUG [2013-05-02 22:55:03,341] org.eclipse.jetty.servlet.ServletHandler: call filter com.airbnb.scheduler.api.RedirectFilter-1583883651 DEBUG [2013-05-02 22:55:03,341] org.eclipse.jetty.servlet.ServletHandler: call servlet com.sun.jersey.spi.container.servlet.ServletContainer-723468318 127.0.0.1 - - [02/May/2013:22:55:03 +0000] "GET /scheduler/jobs HTTP/1.1" 200 218 6 6 DEBUG [2013-05-02 22:55:03,345] org.eclipse.jetty.server.Server: RESPONSE /scheduler/jobs 200 handled=true DEBUG [2013-05-02 22:55:03,417] org.eclipse.jetty.http.HttpParser: filled 1244/1244 DEBUG [2013-05-02 22:55:03,417] org.eclipse.jetty.server.Server: REQUEST /scheduler/graph/csv on BlockingHttpConnection@55eb75d3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-5,l=617,c=0},r=47 DEBUG [2013-05-02 22:55:03,418] org.eclipse.jetty.server.handler.ContextHandler: scope null||/scheduler/graph/csv @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:55:03,418] org.eclipse.jetty.server.handler.ContextHandler: scope null||/scheduler/graph/csv @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:55:03,418] org.eclipse.jetty.server.handler.ContextHandler: context=||/scheduler/graph/csv @ o.e.j.s.ServletContextHandler{/,null} DEBUG [2013-05-02 22:55:03,418] org.eclipse.jetty.servlet.ServletHandler: servlet |/scheduler|/graph/csv -> com.sun.jersey.spi.container.servlet.ServletContainer-723468318 DEBUG [2013-05-02 22:55:03,418] org.eclipse.jetty.servlet.ServletHandler: chain=com.yammer.dropwizard.servlets.ThreadNameFilter-664071981->com.airbnb.scheduler.api.RedirectFilter-1583883651->com.sun.jersey.spi.container.servlet.ServletContainer-723468318 DEBUG [2013-05-02 22:55:03,418] org.eclipse.jetty.servlet.ServletHandler: call filter com.yammer.dropwizard.servlets.ThreadNameFilter-664071981 DEBUG [2013-05-02 22:55:03,418] org.eclipse.jetty.servlet.ServletHandler: call filter com.airbnb.scheduler.api.RedirectFilter-1583883651 DEBUG [2013-05-02 22:55:03,418] org.eclipse.jetty.servlet.ServletHandler: call servlet com.sun.jersey.spi.container.servlet.ServletContainer-723468318 127.0.0.1 - - [02/May/2013:22:55:03 +0000] "GET /scheduler/graph/csv HTTP/1.1" 200 41 3 3 DEBUG [2013-05-02 22:55:03,420] org.eclipse.jetty.server.Server: RESPONSE /scheduler/graph/csv 200 handled=true DEBUG [2013-05-02 22:55:04,538] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:55:06,206] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:55:07,874] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:55:09,541] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms DEBUG [2013-05-02 22:55:11,209] org.apache.zookeeper.ClientCnxn: Got ping response for sessionid: 0x13e674791160002 after 0ms

bentrigat commented 11 years ago

anybody had similar issue ? how else would i troubleshoot this ?

andykram commented 11 years ago

@bentrigat: How do you start chronos on your cluster? What parameters do you pass it? Do you include the dw.hostname parameter, similar to the example run script?

Additionally, is your provided chronos conf the one you attempt to boot your cluster with? If so, the zookeeperServers parameter is incorrect for a cluster configuration, as they'd each be referencing a different, embedded zookeeper.

bentrigat commented 11 years ago

thx @andykram .

i start the chronos with :+1: MESOS_LAUNCHER_DIR=/toto/home/bensix/FromFtp/repo/mesos/build/src/ MESOS_KILLTREE=/toto/home/bensix/FromFtp/repo/mesos/src/scripts/killtree.sh MESOS_NATIVE_LIBRARY=/toto/mesos6/lib/libmesos.so MESOS_NATIVE_LIBRARY=/toto/mesos6/lib/libmesos.so JAVA_HOME=/usr/lib/jvm/java-7-oracle java -cp /toto/home/bensix/FromFtp/repo/chronos/target/chronos-1.0-SNAPSHOT.jar com.airbnb.scheduler.Main server /toto/home/bensix/FromFtp/repo/chronos/config/toto_cluster_scheduler.yml > /tmp/ben/l 2>&1 &

what's dw.hostname for ?

yes the chronos conf i mentioned is the one that fails . If we should not use zookeeperServers when when we use another zk server than the embedded one , then what's the proper parameter and syntax for such , please ?

andykram commented 11 years ago

@bentrigat On second glance, it appears that chronos is not receiving any receive offers from mesos. Are you running a mesos-master and a mesos-slave connected to the same zookeeper?

bentrigat commented 11 years ago

yes i am , that was my goal.

bentrigat commented 11 years ago

so here is my current config : just 2 servers now:

on the zookeeper id 1 + mesos master + chonos / 64.106.186.37 zookeeper conf snippet:

server.1=64.106.186.37:2888:3888 server.2=64.106.186.58:2888:3888

chronos config snippet :

scheduleHorizonSeconds: 60 master: zk://64.106.186.37:2181/mesos zookeeperServers: 64.106.186.37:2181 zookeeperStateZnode: "/airbnb/service/chronos/state" zookeeperLeaderZnode: "/airbnb/service/chronos/leader" zookeeperCandidateZnode: "/airbnb/service/chronos/candidate"

mesos conf is : zk=zk://64.106.186.37:2181/toto master=64.106.186.37:5050

on the zookeeper id 2 + mesos slave / 64.106.186.58 mesos conf is :

zk=zk://64.106.186.37:2181/toto master=64.106.186.37:5050

both have :

cat mesos/deploy/masters 64.106.186.37

cat mesos/deploy/slaves 64.106.186.58

andykram commented 11 years ago

Ah, I see what's going on here.

In your chronos configuration file you specify:

master: zk://64.106.186.37:2181/mesos

Which tells chronos that mesos will live in /mesos. However, in your mesos configuration files you specify:

zk=zk://64.106.186.37:2181/toto

which will tell mesos live in /toto. As such, mesos will never see chronos and vice versa.

To fix this, either:

or

Additionally, it looks like you're only specifying one zookeeper server to both chronos and mesos. This is sufficient to work, but not a recommended configuration. To specify multiple zookeepers, use the following syntax for both chronos's master parameter and mesos's zk parameter:

zk://64.106.186.37:2181,64.106.186.58:2181/PATH

and for chronos's zookeeperServers parameter use:

64.106.186.37:2181,64.106.186.58:2181
bentrigat commented 11 years ago

ok that work snow. thanks ! can i point out that it would be nice to have some kind of notification when the job is not scheduled ? i think whatever the reason is the user would like to be aware the job wont run because it cant be scheduled. thanks again,

andykram commented 11 years ago

@bentrigat glad to hear your issue has been resolved! We're looking into adding reporting/metrics for cases such as not receiving resource offers, the task queue growing too large/quickly, etc.