Executors (kibana) shuts down due to unknown error

philwinder commented 8 years ago

The kibana executors failed 6 times over the course of a weekend (and were restarted, yay mesos!). All other services are running (i.e. ES hasn't shut down over the weekend).

Investigate attached logs to find out why. stderr.txt stdout.txt

sadovnikov commented 8 years ago

Executor is being shut, when it reaches cgroup memory limits. Basically, Kibana runs out of memory. This can be seen from dmesg -T

[2016-02-02 07:09:26]  Task in /system.slice/docker-50c016035226708bbb3e86c24f3e5e5a7f7a0b9448ca16b74b7e62781045e91a.scope killed as a result of limit of /system.slice/docker-50c016035226708bbb3e86c24f3e5e5a7f7a0b9448ca16b74b7e62781045e91a.scope
[2016-02-02 07:09:26]  memory: usage 1048576kB, limit 1048576kB, failcnt 78
[2016-02-02 07:09:26]  memory+swap: usage 1048576kB, limit 9007199254740991kB, failcnt 0
[2016-02-02 07:09:26]  kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
[2016-02-02 07:09:26]  Memory cgroup stats for /system.slice/docker-50c016035226708bbb3e86c24f3e5e5a7f7a0b9448ca16b74b7e62781045e91a.scope: cache:0KB rss:1048576KB rss_huge:38912KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:1048572KB inactive_file:0KB active_file:0KB unevictable:0KB
[2016-02-02 07:09:26]  [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[2016-02-02 07:09:26]  [ 1946]   999  1946   491442   264248    1039        0             0 node
[2016-02-02 07:09:26]  Memory cgroup out of memory: Kill process 1975 (node) score 1011 or sacrifice child

sadovnikov commented 8 years ago

Based on https://github.com/elastic/kibana/issues/5170#issuecomment-163042525 and https://github.com/elastic/kibana/pull/5451, code changes in the framework required to set NODE_OPTIONS environment variable for executors... and we need to move to version 4.4

sadovnikov commented 8 years ago

currently being tested on Alpha Cluster

sadovnikov commented 8 years ago

It's fixed in 0.3.1

mesos / kibana

Executors (kibana) shuts down due to unknown error #13