mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.39k stars 529 forks source link

Multiple curl requests (to add jobs on Chronos) crashing Chronos server #253

Open panwaria opened 10 years ago

panwaria commented 10 years ago

I've sequentially written a bunch of curl requests (~50) in a bash script to create a Chronos job graph in one go. On executing that bash script, my chronos server crashes while giving me a core dump. I'm only able to add partial number of jobs in one go. Jobs are being added successfully if I add them individually.

Here is the failure stack trace:

[2014-09-05 15:21:36,095] INFO State J_chronos_job_34 does not exist yet. Adding to state (com.airbnb.scheduler.state.MesosStatePersistenceStore:146) F0905 15:21:36.175230 27727 org_apache_mesos_state_AbstractState.cpp:319] Check failed: future->isReady() * Check failure stack trace: * @ 0x7f4f1ecb199d google::LogMessage::Fail() @ 0x7f4f1ecb59b7 google::LogMessage::SendToLog() @ 0x7f4f1ecb3839 google::LogMessage::Flush() @ 0x7f4f1ecb3b3d google::LogMessageFatal::~LogMessageFatal() @ 0x7f4f1ec2ef90 Java_org_apache_mesos_state_AbstractState__1_1store_1get @ 0x7f4f18293d45 (unknown) Aborted (core dumped)

Has anyone of you faced similar issue? PS: I've also tried adding 'sleep' between any two consecutive curl calls, but that didn't help.

brndnmtthws commented 10 years ago

Interesting. Seems to be an issue with the state store. Is your ZK cluster healthy?

panwaria commented 10 years ago

Yes, ZK cluster is absolutely healthy. I'm now posting jobs from the same machine running Chronos Server. So, one interesting thing that I observed was that it works totally fine when I use 'localhost', instead of the IP-address of Chronos server.

chengweiv5 commented 10 years ago

It happened to me sometime ago, I also add jobs in a while loop from script and these jobs have unlimited repetition and epsilon:PT15M, I didn't found this issue if these jobs are only one run jobs.

depay commented 10 years ago

Addition to @chengweiv5 , no error happened if i forbidden the procedure "persistJob/Task".

chengweiv5 commented 10 years ago

filed a bug to mesos https://issues.apache.org/jira/browse/MESOS-1804

panwaria commented 10 years ago

Great. Thanks @chengweiv5 !

elingg commented 9 years ago

As an update @connordoyle and @benh are working on this fix in Mesos

panwaria commented 9 years ago

Great! Thanks @elingg , @connordoyle, @benh!

mcabalaji commented 9 years ago

@elingg , @benh , @connordoyle : Any further updates on to the issue.

panwaria commented 9 years ago

Hi, is there any update on this issue?

elingg commented 9 years ago

The bug is marked is fixed in the latest Mesos. See https://issues.apache.org/jira/browse/MESOS-1804. Could you please upgrade your version of Mesos?