mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.39k stars 529 forks source link

Slow memory leak in chronos when running over a long period of time #713

Open digitalyuki opened 8 years ago

digitalyuki commented 8 years ago

When chronos is running on a five-node Apache mesos cluster, there is a gradual memory leak that gradually consumes all available memory on a single node of the mesos-master cluster. Sometimes the memory leak jumps from one mesos-master node to another.

This is what was observed over the course of about a month, on two nodes out of 5, the chronos process began consuming memory, one node at a time. screen shot 2016-08-10 at 6 00 19 pm

The solution, to restart chronos on said node, freed the memory, but immediately after, a different node in the mesos-master cluster has started consuming memory now.

This is our mesos-master cluster configuration:

dandew commented 8 years ago

We have the same issue, even when running with the latest master.

dandew commented 8 years ago

I've compared two heap dumps taken within ~20 minutes (and each after forcing a GC) and here are the top 10 new objects:

screen shot 2016-09-09 at 09 47 13

Mmmh Akka?

dandew commented 8 years ago

After upgrading to Akka 2.4.10, I do not observe any leaked objects from this framework anymore.

@digitalyuki could you try the same on your side and see if it fixes the issue too?

sahilsk commented 8 years ago

Chronos process took 62 % of memory. It seems there is a bug with chronos and it needs restarting once in a while.

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
24599 root      20   0 8558568 4.517g  11276 S   3.0 61.8 988:40.69 java -Djava.library.path=/usr/local/lib:/usr/lib64:/usr/lib -Djava.util.logging.SimpleFormatter.format=%2%5%6%n -Xmx512m -cp /usr/bin/+

root     13951  0.0  0.0  10460   932 pts/7    S+   15:11   0:00 grep --color=auto 24599
root     24599  0.5 61.8 8558568 4735920 ?     Sl   May24 988:40 java -Djava.library.path=/usr/local/lib:/usr/lib64:/usr/lib -Djava.util.logging.SimpleFormatter.format=%2%5%6%n -Xmx512m -cp /usr/bin/chronos org.apache.mesos.chronos.scheduler.Main --zk_hosts zk://zookeeper1.prod-xxx-mesos.xxx.net:2181,zookeeper2.prod-xxx-mesos.xxx.net:2181,zookeeper3.prod-xxx-mesos.xxx.net:2181 --master zk://zookeeper1.prod-xxx-mesos.xxx.net:2181,zookeeper2.prod-xxx-mesos.xxx.net:2181,zookeeper3.prod-xxx-mesos.xxx.net:2181/mesos --http_port 4040 --mail_from xxxchronos@abc.com --mail_server localhost:25

Any fix?

dandew commented 8 years ago

@sahilsk Have you tried what I did?

digitalyuki commented 8 years ago

More recently we're on chronos 2.5 chronos-2.5.0-0.1.20160824153434.ubuntu1404-mesos-1.0.0 from mesosphere/chronos , and while chronos can still end up consuming a fairly substantial amount memory (a little more than 4.5GB of memory), after that point it seems to free up some of it. screen shot 2016-10-25 at 10 42 20 am

brndnmtthws commented 7 years ago

Try tuning the heap settings?