kostya / eye

Process monitoring tool. Inspired from Bluepill and God.
MIT License
1.19k stars 89 forks source link

Eye processes randomly lost #201

Closed innovia closed 6 years ago

innovia commented 6 years ago

Hi Konstantin

I have eye 0.9.2 that from time to time some processes and groups crashes without recovery.

I have to manually reboot the server / load eye where possible.

As you can see in my config i have hard limits on cpu and memory on all processes.

I even turned of the logging of EYE since I had a reason to believe its causing OOM killer to kill EYE.

I've also added ook killer nice commands to cron

*/1 * * * * root /usr/bin/pgrep -f "eye monitoring" | while read PID; do echo -17 > /proc/$PID/oom_adj; done
*/1 * * * * root /usr/bin/pgrep -f "redis-server" | while read PID; do echo -17 > /proc/$PID/oom_adj; done

I need to stabilize this server. please help me out :)

here's my config file

#!/usr/bin/env ruby
Eye.load '/etc/eye/mailer.rb' # mailer set params (like variables)
Eye.load '/etc/eye/cloudwatch.rb'
Eye.load '/etc/eye/config.rb' # config assign params values

Eye.application :reDash do
  working_dir "/opt/redash/current"

  load_env "/root/.env"

  trigger :flapping, times: 3, within: 1.minute, retry_in: 5.minutes
  notify :by_email, :info
  notify :cloudwatch, :info

  process(:nginx) do
    depend_on :gunicorn
    pid_file "/var/run/nginx.pid"
    stdall "/opt/redash/logs/nginx.log"
    start_command "/usr/sbin/nginx"
    stop_signals [:QUIT, 30.seconds, :TERM, 15.seconds, :KILL]
    restart_command "kill -HUP {PID}"
    monitor_children do
      restart_command 'kill -TERM {PID}'
      check :memory, below: 200.megabytes, times: [3,5]
    end
    daemonize true
  end

  process(:redis) do
    pid_file "/var/run/redis.pid"
    stdall "/opt/redash/logs/redis.log"
    start_command "/usr/local/bin/redis-server /etc/redis/6379.conf"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
  end

  process(:gunicorn) do
    uid 'redash'
    gid 'nogroup'
    depend_on :redis
    pid_file "/var/run/gunicorn/gunicorn.pid"
    stdall "/opt/redash/logs/gunicorn.log"
    start_command "gunicorn -b unix:///var/run/gunicorn/gunicorn.sock --name redash -w 4 --max-requests 1000 redash.wsgi:app"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
    monitor_children do
      stop_command "kill -TERM {PID}"
      check :cpu, :every => 30, :below => 80, :times => 3
      check :memory, :every => 30, :below => 250.megabytes, :times => [3,5]
    end
  end

  process(:flower) do
    uid 'redash'
    gid 'nogroup'
    pid_file "/var/run/celery/flower.pid"
    stdall "/opt/redash/logs/flower.log"
    start_command "celery flower -A redash.worker --address=0.0.0.0 --persistent"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    check :cpu, :every => 30, :below => 80, :times => 3
    check :memory, :every => 30, :below => 250.megabytes, :times => [3,5]
    daemonize true
  end

  process(:celery_worker) do
    uid 'redash'
    gid 'nogroup'
    pid_file "/var/run/celery/celery_worker.pid"
    stdall "/opt/redash/logs/celery_worker.log"
    start_command "celery worker --app=redash.worker --beat -c2 -Qqueries,celery --maxtasksperchild=10 -Ofair --autoscale=6,3 -n redash_celery_worker@%h"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
    monitor_children do
      stop_command "kill -TERM {PID}"
      check :cpu, :every => 30, :below => 80, :times => 3
      check :memory, :every => 30, :below => 400.megabytes, :times => [3,5]
    end
  end

   process(:celery_schedule_worker) do
    uid 'redash'
    gid 'nogroup'
    pid_file "/var/run/celery/celery_schedule_worker.pid"
    stdall "/opt/redash/logs/celery_schedule_worker.log"
    start_command "celery worker --app=redash.worker -c2 -Qscheduled_queries --maxtasksperchild=10 -Ofair  --autoscale=2,4 -n redash_celery_scheduled@%h"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
    monitor_children do
      stop_command "kill -TERM {PID}"
      check :cpu, :every => 30, :below => 80, :times => 3
      check :memory, :every => 30, :below => 400.megabytes, :times => [3,5]
    end
  end
end
kostya commented 6 years ago

are you enable Eye logger? it write to it every things that happen. you probably can find a problem here.

innovia commented 6 years ago

Ill try. And update you

innovia commented 6 years ago

i'll re-open this if the issue persist