mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.39k stars 529 forks source link

Chronos Arithmetic Exception /Zero #614

Open Saurabh2004in opened 8 years ago

Saurabh2004in commented 8 years ago

Hi,

I am getting below exception, Just curious to know what causing this issue/

[2016-01-07 13:48:02,153] INFO Loading jobs (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:601)

[2016-01-07 13:48:02,240] INFO Registering jobs:55 (org.apache.mesos.chronos.scheduler.jobs.JobUtils$:74)

[2016-01-07 13:48:02,259] ERROR Loading tasks or jobs failed. Exiting. (org.apache.mesos.chronos.scheduler.jobs.JobScheduler:605)

java.lang.ArithmeticException: / by zero

           at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.calculateSkips(JobUtils.scala:157)

           at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.skipForward(JobUtils.scala:119)

           at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.makeScheduleStream(JobUtils.scala:107)

           at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anonfun$6.apply(JobScheduler.scala:146)

           at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anonfun$6.apply(JobScheduler.scala:146)

           at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)

           at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)

           at scala.collection.immutable.List.foreach(List.scala:381)

           at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)

           at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)

           at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)

           at scala.collection.AbstractTraversable.map(Traversable.scala:104)

           at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.registerJob(JobScheduler.scala:146)

           at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:75)

           at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.liftedTree1$1(JobScheduler.scala:602)

           at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.onElected(JobScheduler.scala:597)

           at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anon$3.isLeader(JobScheduler.scala:568)

           at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:644)

           at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:640)

           at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92)

           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

           at java.lang.Thread.run(Thread.java:745)

It looks like calculateSlips on JobUtils.scala is throwing exception. Just want to make sure its a chronos bug or someting related to cron expression causing this.

/**

jordmoz commented 8 years ago

I'm seeing this as well.

xargstop commented 8 years ago

I get this issue too. And all chronos cant restart.

The reason is chronos allows job with run_interval equal 0 to be created, eg.

"schedule":"R0/2015-08-28T14:04:54.000+0800/PT0M"

But the exception would be triggered when reload jobs from zookeeper, such as restart.

I delete the jobs with such config and restart successfully.

xtazz commented 8 years ago

@gongaiguo how do you delete these jobs without chronos started?

Saurabh2004in commented 8 years ago

I set the else part to zero , we don't need to skip time if interval is zero.

Saurabh2004in commented 8 years ago

Same patch is applied in#692

xargstop commented 8 years ago

@xtazz I deleted them from zookeeper.

bfoussier commented 8 years ago

Hi,

I met the problem when doing HA tests. When chronos restarts it reloads jobs stored in Zookeeper (job was { "schedule": "R//P", "name": "create-volume-flocker-demo", "command"...}, ) and fails.

I applied the fix proposed at https://github.com/mesos/chronos/pull/692 and now Chronos loops infinitely : [2016-07-20 09:21:49,968] INFO Calling next for stream: R/2016-07-18T09:38:44.236Z/PT0S, jobname: create-volume-flocker-demo (org.apache.mesos.chrono\ s.scheduler.jobs.JobScheduler:509) [2016-07-20 09:21:49,968] INFO JobNotificationObserver does not handle JobSkipped(ScheduleBasedJob(R/2016-07-18T09:38:44.236Z/PT0S,create-volume-floc\ ker-demo,docker volume create -d flocker --name apache_vol_2_staging -o size=45GB,PT60S,0,0,,,,2,,,,,,false,0.1,256.0,128.0,false,0,ListBuffer(),List\ Buffer(),false,root,null,,ListBuffer(),true,ListBuffer(),false,false,ListBuffer()),2016-07-18T09:38:44.236Z) (org.apache.mesos.chronos.scheduler.jobs\ .JobsObserver$:27) [2016-07-20 09:21:49,968] INFO JobStats does not handle JobSkipped(ScheduleBasedJob(R/2016-07-18T09:38:44.236Z/PT0S,create-volume-flocker-demo,docker\ volume create -d flocker --name apache_vol_2_staging -o size=45GB,PT60S,0,0,,,,2,,,,,,false,0.1,256.0,128.0,false,0,ListBuffer(),ListBuffer(),false,\ root,null,,ListBuffer(),true,ListBuffer(),false,false,ListBuffer()),2016-07-18T09:38:44.236Z) (org.apache.mesos.chronos.scheduler.jobs.JobsObserver$:\ 27) [2016-07-20 09:21:49,968] INFO tail: R/2016-07-18T09:38:44.236Z/PT0S now: 2016-07-20T09:21:48.145Z (org.apache.mesos.chronos.scheduler.jobs.JobSchedu\ ler:563)

and it restarts for same job.

[2016-07-20 09:21:49,968] INFO Calling next for stream: R/2016-07-18T09:38:44.236Z/PT0S, jobname: create-volume-flocker-demo (org.apache.mesos.chrono\ s.scheduler.jobs.JobScheduler:509)

Do I need another fix ? Does the proposed fix at https://github.com/mesos/chronos/pull/692 prevent from storing corrupted data in Zookeeper ? Are my data corrupted in Zookeeper and should I erase them ?