kostya / eye

Process monitoring tool. Inspired from Bluepill and God.
MIT License
1.19k stars 86 forks source link

unexpected server response :corrupted_data #111

Closed Zhomart closed 9 years ago

Zhomart commented 9 years ago

I started getting this error unexpected server response :corrupted_data when I try to check info using $ eye i. Restarting eye helps. But the error appears in sometime again.

I have ~22 processes to monitor.

My eye conf file

def app_log(name)
  "/home/projects/my_project/current/log/#{name}-production.log"
end

def resque_process(eye, name, queues)
  rails_env = eye.env['RAILS_ENV']

  eye.process(name) do
    pid_file  "/home/projects/my_project/current/tmp/pids/resque.#{name}.pid"
    stdall    app_log("resque.#{name}")

    env 'QUEUE' => queues, 'TERM_CHILD' => '1' # global env for each processes

    start_command "/usr/local/rvm/bin/bootup_bundle exec rake resque:work"

    start_timeout 20.seconds

    stop_signals [:QUIT, 15.seconds, :TERM, 2.seconds, :KILL]

    daemonize true

    check :cpu, :every => 30, :below => 100, :times => [ 2, 13 ]

    max_memory = 400
    check :memory, :every => 30, :below => max_memory.megabytes, :times => 5
  end
end

def puma_process(puma_socket, index)
  process :"puma-#{index}" do
    puma_pid = "/home/projects/my_project/current/tmp/pids/puma-#{index}.pid"
    pid_file puma_pid
    stdall app_log("puma-#{index}")

    start_options = [
      # "--preload",
      # "-w ",
      "-t 32:64",
      "-e production",
      "-b unix://#{puma_socket}",
      "--pidfile #{puma_pid}",
    ]
    start_command "/usr/local/rvm/bin/bootup_bundle exec puma #{start_options.join(' ')}"

    daemonize true

    restart_grace 10.seconds # just sleep this until process get up status
                            # (maybe enought to puma soft restart)

    start_grace 10.seconds
    start_timeout 40.seconds

    stop_signals [:QUIT, 5.seconds, :TERM, 2.second, :KILL]

    check :socket, addr: "unix:#{puma_socket}", every: 15.seconds, times: 3,
                    timeout: 2.seconds

    check :cpu, :every => 30, :below => 80, :times => 3

    max_memory = 400
    check :memory, :every => 30, :below => max_memory.megabytes, :times => [3,5]
  end
end

Eye.application "my_project" do
  env "RAILS_ENV" => 'production'

  working_dir "/home/projects/my_project/current"

  stdall '/var/log/eye/my_project.log' # stdout,err logs for processes by default

  trigger :flapping, times: 10, within: 1.minute, retry_in: 10.minutes
  check :cpu, every: 10.seconds, below: 100, times: 3

  group 'puma' do
      puma_process("/tmp/my_project-0.sock", 0)
      puma_process("/tmp/my_project-1.sock", 1)
      puma_process("/tmp/my_project-2.sock", 2)
      puma_process("/tmp/my_project-3.sock", 3)
  end

  group 'resque' do
      resque_process(self, 'general-0', '*')
      resque_process(self, 'general-1', '*')
      resque_process(self, 'general-2', '*')
      resque_process(self, 'general-3', '*')
      resque_process(self, 'general-4', '*')
      resque_process(self, 'general-5', '*')
      resque_process(self, 'general-6', '*')
      resque_process(self, 'general-7', '*')
      resque_process(self, 'general-8', '*')
      resque_process(self, 'general-9', '*')
      resque_process(self, 'general-10', '*')
      resque_process(self, 'general-11', '*')
  end

  process("py_cluster") do
    working_dir "/home/projects/my_project/current/work/py_cluster"

    pid_file  "/home/projects/my_project/current/tmp/pids/py_cluster.pid"
    stdall    app_log("py_cluster")

    start_command "python2 cluster_server.py"

    start_timeout 5.seconds

    stop_signals [:QUIT, 5.seconds, :TERM, 2.seconds, :KILL]

    daemonize true

    check :cpu, :every => 30, :below => 100, :times => 7
    check :memory, :every => 30, :below => 400.megabytes, :times => 5
  end

end

In the log files I found this

13.02.2015 20:40:32 ERROR -- [my_project:resque:general-7] check:memory(<400Mb) attempted to call a dead actor ["/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/proxies/sync_proxy.rb:23:in `method_missing'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/system_resources.rb:9:in `memory'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker/memory.rb:10:in `get_value'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker.rb:133:in `get_value_safe'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker.rb:105:in `check'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:60:in `watcher_tick'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:42:in `call'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:42:in `block in add_watcher'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/actor.rb:416:in `block in task'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/tasks.rb:55:in `block in initialize'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/tasks/task_fiber.rb:13:in `block in create'"]
13.02.2015 20:40:35 ERROR -- [my_project:puma:puma-2] check:cpu(<80%) attempted to call a dead actor ["/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/proxies/sync_proxy.rb:23:in `method_missing'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/system_resources.rb:13:in `cpu'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker/cpu.rb:10:in `get_value'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker.rb:133:in `get_value_safe'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker.rb:105:in `check'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:60:in `watcher_tick'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:42:in `call'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:42:in `block in add_watcher'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/actor.rb:416:in `block in task'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/tasks.rb:55:in `block in initialize'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/tasks/task_fiber.rb:13:in `block in create'"]
13.02.2015 20:40:35 ERROR -- [my_project:puma:puma-2] check:memory(<400Mb) attempted to call a dead actor ["/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/proxies/sync_proxy.rb:23:in `method_missing'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/system_resources.rb:9:in `memory'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker/memory.rb:10:in `get_value'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker.rb:133:in `get_value_safe'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker.rb:105:in `check'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:60:in `watcher_tick'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:42:in `call'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:42:in `block in add_watcher'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/actor.rb:416:in `block in task'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/tasks.rb:55:in `block in initialize'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/tasks/task_fiber.rb:13:in `block in create'"]
kostya commented 9 years ago

seems you have some exception, can you find it in eye log?

Zhomart commented 9 years ago

I found only this kind of many errors:

13.02.2015 20:40:32 ERROR -- [my_project:resque:general-7] check:memory(<400Mb) attempted to call a dead actor ["/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/proxies/sync_proxy.rb:23:in `method_missing'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/system_resources.rb:9:in `memory'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker/memory.rb:10:in `get_value'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker.rb:133:in `get_value_safe'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/checker.rb:105:in `check'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:60:in `watcher_tick'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:42:in `call'", "/usr/local/rvm/gems/ruby-2.1.5/gems/eye-0.6.4/lib/eye/process/watchers.rb:42:in `block in add_watcher'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/actor.rb:416:in `block in task'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/tasks.rb:55:in `block in initialize'", "/usr/local/rvm/gems/ruby-2.1.5/gems/celluloid-0.15.2/lib/celluloid/tasks/task_fiber.rb:13:in `block in create'"]
kostya commented 9 years ago

before first this error, should be exception

kostya commented 9 years ago

looks like Sigar exception, what OS are use using?

try this in rib:

require 'eye'
Eye::Control
Eye::SystemResources.memory($$)
Eye::SystemResources.cpu($$)
Zhomart commented 9 years ago

Currently the error happened several times when processes were under heavy load.

While looking at eye.log I found more info

13.02.2015 20:38:55 ERROR -- [celluloid] Eye::Process crashed!
Celluloid::DeadActorError: attempted to call a dead actor
13.02.2015 20:38:55 ERROR -- [celluloid] Eye::Group crashed!
Celluloid::DeadActorError: attempted to call a dead actor
13.02.2015 20:38:55 ERROR -- [celluloid] Eye::Controller crashed!
Celluloid::DeadActorError: attempted to call a dead actor
13.02.2015 20:38:55 ERROR -- [celluloid] Eye::Server crashed!
Celluloid::DeadActorError: attempted to call a dead actor

...

13.02.2015 20:38:23 ERROR -- [celluloid] Eye::Process crashed!
Celluloid::DeadTaskError: cannot resume a dead task (can't alloc machine stack to fiber: Cannot allocate memory)
13.02.2015 20:38:23 ERROR -- [celluloid] Eye::Process crashed!
Celluloid::DeadTaskError: cannot resume a dead task (can't alloc machine stack to fiber: Cannot allocate memory)
Celluloid::DeadTaskError: cannot resume a dead task (can't alloc machine stack to fiber: Cannot allocate memory)
13.02.2015 20:38:23 ERROR -- [celluloid] Eye::Process: ERROR HANDLER CRASHED!

I'll try to change max_mem and reduce number of workers, then try stress test again.

Results of your script:

2.1.5 :001 > require 'eye'
 => true
2.1.5 :002 > Eye::Control
 => #<Celluloid::ActorProxy(Eye::Controller:0x67ca3c) @applications=[] @current_config=#<Eye::Config:0x00000000c404f0 @settings={}, @applications={}> @logger=#<Eye::Logger:0x00000000c15728 @prefix="Eye", @subprefix=nil, @pref_string="[Eye] ">>
2.1.5 :003 > Eye::SystemResources.memory($$)
 => 19415040
2.1.5 :004 > Eye::SystemResources.cpu($$)
 => 0.0
2.1.5 :005 >
kostya commented 9 years ago

this is not exception, marked as [celluloid], usually celluloid output when eye quit. Find first exception before first attempted to call a dead actor

Zhomart commented 9 years ago

I couldn't find any exceptions.