kostya / eye

Process monitoring tool. Inspired from Bluepill and God.
MIT License
1.19k stars 89 forks source link

Eye crashes unless run in foreground #237

Open brandondrew opened 4 years ago

brandondrew commented 4 years ago

This works (note the -f):

app_rsp@ip-10-0-0-36:/var/www/rsp/current$ bundle exec eye load -f lib/eye/sidekiq.eye
2020.06.26 13:49:46 INFO  -- [Eye] starting Eye v0.10.0 (c) 2012-2017 @kostya <27950>
2020.06.26 13:49:46 INFO  -- [Eye] => command: load /var/www/rsp/releases/20200621015518/lib/eye/sidekiq.eye
2020.06.26 13:49:46 INFO  -- [Eye] => loading: ["/var/www/rsp/releases/20200621015518/lib/eye/sidekiq.eye"]
config
/var/www/rsp/releases/20200621015518
2020.06.26 13:49:46 INFO  -- [Eye] loading: /var/www/rsp/releases/20200621015518/lib/eye/sidekiq.eye
2020.06.26 13:49:46 INFO  -- [sidekiq:__default__] call:
2020.06.26 13:49:46 INFO  -- [sidekiq:__default__] schedule :monitor (load by user)
2020.06.26 13:49:46 INFO  -- [sidekiq:__default__] => monitor  (load by user)
2020.06.26 13:49:46 INFO  -- [sidekiq:__default__] starting async with 0.2s chain monitor
2020.06.26 13:49:46 INFO  -- [Eye] <= loading: ["/var/www/rsp/releases/20200621015518/lib/eye/sidekiq.eye"]
2020.06.26 13:49:46 INFO  -- [Eye] <= command: load /var/www/rsp/releases/20200621015518/lib/eye/sidekiq.eye (0.061073686s)
2020.06.26 13:49:46 INFO  -- [sidekiq:sidekiq] schedule :monitor (monitor by user)
2020.06.26 13:49:46 INFO  -- [sidekiq:sidekiq] => monitor  (monitor by user)
2020.06.26 13:49:46 INFO  -- [sidekiq:__default__] <= monitor
2020.06.26 13:49:46 INFO  -- [sidekiq:sidekiq] load_external_pid_file: pid_file not found
2020.06.26 13:49:46 INFO  -- [sidekiq:sidekiq] switch :starting [:unmonitored => :starting] monitor by user
2020.06.26 13:49:46 INFO  -- [sidekiq:sidekiq] daemonizing: `bundle exec sidekiq -C ./config/sidekiq.yml -e production` with start_grace: 10.0s, env: '', <27973> (in /var/www/rsp/releases/20200621015518)
2020.06.26 13:49:46 INFO  -- [sidekiq:sidekiq] sleeping for :start_grace 10.0
2020.06.26 13:49:56 INFO  -- [sidekiq:sidekiq] switch :started [:starting => :up] monitor by user

Without -f it FAILS:

app_rsp@ip-10-0-0-36:/var/www/rsp/current$ bundle exec eye load lib/eye/sidekiq.eye
Killed

To test whether this was because it somehow lost the environment variable when run in the background, I hardcoded the value 'production' in the lib/eye/sidekiq.eye:

    #start_command "bundle exec sidekiq -C ./config/sidekiq.yml -e #{ENV['RAILS_ENV']}"
    start_command "bundle exec sidekiq -C ./config/sidekiq.yml -e production"

and it was still instantly killed.

When I watch the sidekiq log I don't see anything appear at all, unless I use the -f flag with eye load. And when I tail the eye log, there is likewise nothing that appears. So there's no log output that gives me a clue as to what's going on.

Checking the configuration, eye tells me it's fine:

app_rsp@ip-10-0-0-36:/var/www/rsp/current$ bundle exec eye check lib/eye/sidekiq.eye
config
/var/www/rsp/releases/20200621015518
Config ok!
brandondrew commented 4 years ago

My config file (with the change for simplifying testing)

def sidekiq_process(proxy, name)

  proxy.process(name) do
    #start_command "bundle exec sidekiq -C ./config/sidekiq.yml -e #{ENV['RAILS_ENV']}"
    start_command "bundle exec sidekiq -C ./config/sidekiq.yml -e production"
    pid_file "tmp/pids/#{name}.pid"
    stdall "log/#{name}.log"

    daemonize true
    # trigger :flapping, times: 10, within: 1.minute, retry_in: 10.minutes
    stop_signals [:TSTP, 0, :TERM, 10.seconds, :KILL]
    start_grace 10.seconds
    start_timeout 90.seconds
    stop_timeout 30.seconds

    # check :cpu, every: 30, below: 100, times: 5
    # check :memory, every: 30, below: 300.megabytes, times: 5
  end
end

Eye.application "sidekiq" do
  working_dir File.expand_path('../..', File.join(File.dirname(__FILE__)))
  sidekiq_process self, :sidekiq
end

Eye.config do
  current = File.expand_path('../..', File.join(File.dirname(__FILE__)))
  logger "#{current}/log/eye.log"
end
kostya commented 4 years ago

looks like your ~/.eye folder can not be writable, or there some old pid, try to delete it.

brandondrew commented 4 years ago

The original problem went away after rebooting the server. (I never thought I would see that happen on Linux.) But it came back several weeks later, so I tried troubleshooting again. I found that using nohup to put eye in the background instead instead of letting eye do it itself is one workaround (which unfortunately makes things more complicated, and by default logs to nohup.out and is therefore not an ideal solution):

nohup bundle exec eye load -f /var/www/rsp/current/lib/eye &

I then returned here and read your reply, and found that you were exactly right about the old pid file needing to be deleted.

Why would an old pid be left around? Do you have any ideas?

Besides a bug, the only theory I could think of is that someone has not been cleanly rebooting the server. I do have confirmation that the system administrator sometimes reboots from the AWS console rather than the shell, although I don't know if that would be a problem or not.

Is it possible to make eye (when started up) not simply believe it is already running based on the presence of the pid file, but also verify this by checking for a running process? That would avoid the situation I found myself in.

In the meantime, I have written a script that I trigger with cron:

#!/usr/bin/env bash

# This addresses the intermittent problem of eye not cleaning up its old PIDs....
# An old PID prevents eye from starting up normally in the background.

APP_DIR="/var/www/rsp/current"
EYE_LOG="${APP_DIR}/log/eye.log"
EYE_PID=$(cat $HOME/.eye/pid 2> /dev/null)      # get the PID
kill -0 ${EYE_PID} 2> /dev/null                 # return an error if PID is missing; this won't kill anything
ERROR=$(echo $?)                                # get the error code

if [[ ${ERROR} != "0" ]]; then                  # if the error code indicates eye was not running, then start it up
  echo "ERROR: eye is not running... attempting to start..." >> "$EYE_LOG"
  rm -f $EYE_PID
  export PATH="$HOME/.rbenv/bin:$PATH"
  eval "$(rbenv init -)"
  { date; cd /var/www/rsp/current; bundle exec eye load /var/www/rsp/current/lib/eye; date; } >> "${EYE_LOG}" 2>&1
else
  echo "SUCCESSFULLY VERIFIED THAT eye IS RUNNING.  " >> "${EYE_LOG}"
fi

These are my crontab entries:

@reboot /var/www/rsp/current/scripts/ensure_eye_is_running.sh
* * * * * /var/www/rsp/current/scripts/ensure_eye_is_running.sh

I hope my solution is helpful to someone, but I think ideally eye itself would do this very same thing when it starts up.