Open digitalyuki opened 8 years ago
We have the same issue, even when running with the latest master.
I've compared two heap dumps taken within ~20 minutes (and each after forcing a GC) and here are the top 10 new objects:
Mmmh Akka?
After upgrading to Akka 2.4.10, I do not observe any leaked objects from this framework anymore.
@digitalyuki could you try the same on your side and see if it fixes the issue too?
Chronos process took 62 % of memory. It seems there is a bug with chronos and it needs restarting once in a while.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24599 root 20 0 8558568 4.517g 11276 S 3.0 61.8 988:40.69 java -Djava.library.path=/usr/local/lib:/usr/lib64:/usr/lib -Djava.util.logging.SimpleFormatter.format=%2%5%6%n -Xmx512m -cp /usr/bin/+
root 13951 0.0 0.0 10460 932 pts/7 S+ 15:11 0:00 grep --color=auto 24599
root 24599 0.5 61.8 8558568 4735920 ? Sl May24 988:40 java -Djava.library.path=/usr/local/lib:/usr/lib64:/usr/lib -Djava.util.logging.SimpleFormatter.format=%2%5%6%n -Xmx512m -cp /usr/bin/chronos org.apache.mesos.chronos.scheduler.Main --zk_hosts zk://zookeeper1.prod-xxx-mesos.xxx.net:2181,zookeeper2.prod-xxx-mesos.xxx.net:2181,zookeeper3.prod-xxx-mesos.xxx.net:2181 --master zk://zookeeper1.prod-xxx-mesos.xxx.net:2181,zookeeper2.prod-xxx-mesos.xxx.net:2181,zookeeper3.prod-xxx-mesos.xxx.net:2181/mesos --http_port 4040 --mail_from xxxchronos@abc.com --mail_server localhost:25
Any fix?
@sahilsk Have you tried what I did?
More recently we're on chronos 2.5 chronos-2.5.0-0.1.20160824153434.ubuntu1404-mesos-1.0.0
from mesosphere/chronos , and while chronos can still end up consuming a fairly substantial amount memory (a little more than 4.5GB of memory), after that point it seems to free up some of it.
Try tuning the heap settings?
When chronos is running on a five-node Apache mesos cluster, there is a gradual memory leak that gradually consumes all available memory on a single node of the mesos-master cluster. Sometimes the memory leak jumps from one mesos-master node to another.
This is what was observed over the course of about a month, on two nodes out of 5, the chronos process began consuming memory, one node at a time.
The solution, to restart chronos on said node, freed the memory, but immediately after, a different node in the mesos-master cluster has started consuming memory now.
This is our mesos-master cluster configuration: