mojombo / god

Ruby process monitor
http://godrb.com
MIT License
2.21k stars 520 forks source link

Ruby 1.9.1 and god #6

Open shadowhand opened 14 years ago

shadowhand commented 14 years ago

Attempting to run god with Ruby 1.9.1 results in the following error:

F [2009-11-27 06:14:27] FATAL: Unhandled exception in driver loop - (NoMethodError): undefined method `critical=' for Thread:Class
/usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/driver.rb:100:in `ensure in pop'
/usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/driver.rb:100:in `pop'
/usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/driver.rb:166:in `block (2 levels) in initialize'
/usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/driver.rb:164:in `loop'
/usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/driver.rb:164:in `block in initialize'
/usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/driver.rb:112:in `push': undefined method `critical=' for Thread:Class (NoMethodError)
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/driver.rb:192:in `message'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/task.rb:151:in `move'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/watch.rb:96:in `monitor'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god.rb:608:in `block in start'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god.rb:608:in `each'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god.rb:608:in `start'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god.rb:636:in `at_exit'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god.rb:669:in `block in <top (required)>'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/cli/run.rb:87:in `fork'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/cli/run.rb:87:in `run_daemonized'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/cli/run.rb:17:in `dispatch'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/lib/god/cli/run.rb:8:in `initialize'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/bin/god:119:in `new'
    from /usr/lib/ruby/gems/1.9.1/gems/god-0.7.18/bin/god:119:in `<top (required)>'
    from /usr/bin/god:19:in `load'
    from /usr/bin/god:19:in `<main>'
shadowhand commented 14 years ago

The same error happens with god 0.7.22 as well.

cmelbye commented 14 years ago

It looks like God is no longer using that "critical=" method of Thread which no longer exists in Ruby 1.9, so it might work now.

eric commented 14 years ago

god 0.8 has removed the usage of Thread.critical=.

I believe there may be a couple other problems that need to be solved before 1.9 works, though.

geoffgarside commented 14 years ago

I'm getting this with 0.8.0 on Ruby 1.9.1

    Unhandled exception in driver loop - (NotImplementedError): timeout is not implemented yet 
    /usr/local/lib/ruby/1.9/monitor.rb:92:in `wait'
    /usr/local/lib/ruby19/gems/1.9/gems/god-0.8.0/lib/god/driver.rb:88:in `block in pop'
    /usr/local/lib/ruby/1.9/monitor.rb:190:in `mon_synchronize'
    /usr/local/lib/ruby19/gems/1.9/gems/god-0.8.0/lib/god/driver.rb:82:in `pop' 
    /usr/local/lib/ruby19/gems/1.9/gems/god-0.8.0/lib/god/driver.rb:146:in `block (2 levels) in initialize' 
    /usr/local/lib/ruby19/gems/1.9/gems/god-0.8.0/lib/god/driver.rb:144:in `loop' 
    /usr/local/lib/ruby19/gems/1.9/gems/god-0.8.0/lib/god/driver.rb:144:in `block in initialize'

I've not had a chance to dig into it though, but I'll have a look over the next few days.

igrigorik commented 14 years ago

bump

I think this should be promoted to critical...

eric commented 14 years ago

igrigorik: Did the fix you tried end up working?

igrigorik commented 14 years ago

Partially. I installed your branch which introduces compat19.rb + grabbed thread.rb from latest head of ruby 1.9 to make it work. That solved the "driver loop" exception show above, but there are other bugs still.

Namely, god appears to hang indefinitely when it tries to fork a new process: I see the child that it's trying to spawn in the tree, and it appears to be stuck in that state. (Doesn't happen everytime & unfortunately don't know the exact conditions under which this happens)

eric commented 14 years ago

Could you attach gdb to the process and give the output of running "where"?

That's odd that it would be hanging there — there isn't a lot that it does before running exec().

igrigorik commented 14 years ago

Attaching to the god process itself yields:

(gdb) where
#0  0x00002b9122f581ad in pthread_cond_destroy@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000051622e in terminate_atfork_i (key=, val=, current_th=)
    at thread_pthread.c:178
#2  0x00000000004b00b9 in st_foreach (table=0x7dc8f0, func=0x516120 , arg=23970512) at st.c:708
#3  0x0000000000513d88 in rb_thread_atfork () at thread.c:2649
#4  0x000000000046f60e in rb_f_fork (obj=) at process.c:2618
#5  0x00000000004fedef in vm_call_cfunc (th=0x16dc2d0, reg_cfp=0x2aaaae83dc48, num=0, recv=16886520, blockptr=0x2aaaae83dc70, flag=202, me=0x8bbdd0)
    at vm_insnhelper.c:385
#6  0x0000000000511b43 in vm_call_method (th=0x16dc2d0, cfp=dwarf2_read_address: Corrupted DWARF expression.
) at vm_insnhelper.c:510
#7  0x00000000005045d6 in vm_exec_core (th=0x16dc2d0, initial=) at insns.def:994
#8  0x0000000000509a04 in vm_exec (th=0x16dd5b8) at vm.c:1099
#9  0x000000000050a0c3 in invoke_block_from_c (th=0x16dc2d0, block=0x2aaaae83df30, self=16897880, argc=0, argv=0x0, blockptr=0x0, cref=0x0)
    at vm.c:543
#10 0x000000000050a948 in loop_i () at vm.c:573
#11 0x0000000000416ae9 in rb_rescue2 (b_proc=0x50a910 , data1=0, r_proc=0, data2=0) at eval.c:574
#12 0x00000000004fe739 in rb_f_loop (self=16897880) at vm_eval.c:605
#13 0x00000000004fedef in vm_call_cfunc (th=0x16dc2d0, reg_cfp=0x2aaaae83df08, num=0, recv=16897880, blockptr=0x2aaaae83df30, flag=202, me=0x864210)
    at vm_insnhelper.c:385
#14 0x0000000000511b43 in vm_call_method (th=0x16dc2d0, cfp=dwarf2_read_address: Corrupted DWARF expression.
) at vm_insnhelper.c:510
#15 0x00000000005045d6 in vm_exec_core (th=0x16dc2d0, initial=) at insns.def:994
#16 0x0000000000509a04 in vm_exec (th=0x16dd5b8) at vm.c:1099
#17 0x000000000050a0c3 in invoke_block_from_c (th=0x16dc2d0, block=0x142fd40, self=16897880, argc=0, argv=0x0, blockptr=0x0, cref=0x0) at vm.c:543
#18 0x000000000050a5d2 in rb_vm_invoke_proc (th=0x16dc2d0, proc=0x142fd40, self=16897880, argc=0, argv=0x101ac70, blockptr=0x0) at vm.c:590
#19 0x00000000005197d5 in thread_start_func_2 (th=0x16dc2d0, stack_start=0x40613120) at thread.c:387
#20 0x00000000005199ee in thread_start_func_1 (th_ptr=0x16dd5b8) at thread_pthread.c:351
#21 0x00002b9122f541b5 in start_thread () from /lib64/libpthread.so.0
#22 0x00002b9123afa36d in clone () from /lib64/libc.so.6
#23 0x0000000000000000 in ?? ()

Also, strace:

Process 10887 attached - interrupt to quit
futex(0x16dd5b8, FUTEX_WAIT, 2, NULL 
Process 10887 detached

When it's in this state, the process is completely unresponsive.. god log / status, etc., all return "server not available".

eric commented 14 years ago

I've committed another method to remove the need to grab thread.rb. It is available here:

http://github.com/eric/god/tree/condition-variable-1.9-support

I'm still trying to figure out why it would be hanging here.

igrigorik commented 14 years ago

Eric, don't think that solved it. I'm trying to find the actual sequence that reproduces the error, but so far this it the pattern I'm seeing: startups fine, loads all the processes it should. Then I start randomly killing processes (kill -9) to get god to bring them back up. Works well for the most part, except every once in a while, it get's stuck. In ps I can see that god is trying to fork a process, but it's stuck in that state.. Below is the output of god process itself, and the child that's trying to fork:

god process itself..

God process itself:
-------------------

(gdb) where
#0  0x00002aec325a1376 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000516561 in native_sleep (th=0x7dc500, tv=0x0) at thread_pthread.c:618
#2  0x0000000000519edc in thread_join_sleep (arg=) at thread.c:770
#3  0x0000000000416714 in rb_ensure (b_proc=0x519e80 , data1=140737486727824, e_proc=0x513170 , 
    data2=140737486727824) at eval.c:671
#4  0x0000000000514c7c in thread_join (target_th=0x1830c70, delay=1e+30) at thread.c:637
#5  0x0000000000514da2 in thread_join_m (argc=, argv=, self=) at thread.c:718
#6  0x00000000004fedef in vm_call_cfunc (th=0x7dc500, reg_cfp=0x2aec324a5a90, num=0, recv=15750920, blockptr=0x0, flag=0, me=0x8d9b60)
    at vm_insnhelper.c:385
#7  0x0000000000511b43 in vm_call_method (th=0x7dc500, cfp=dwarf2_read_address: Corrupted DWARF expression.
) at vm_insnhelper.c:510
#8  0x00000000005045d6 in vm_exec_core (th=0x7dc500, initial=) at insns.def:994
#9  0x0000000000509a04 in vm_exec (th=0x7dc594) at vm.c:1099
#10 0x000000000050a0c3 in invoke_block_from_c (th=0x7dc500, block=0x1619f80, self=8631080, argc=0, argv=0x0, blockptr=0x0, cref=0x0) at vm.c:543
#11 0x000000000050a5d2 in rb_vm_invoke_proc (th=0x7dc500, proc=0x1619f80, self=8631080, argc=0, argv=0x10456a0, blockptr=0x0) at vm.c:590
#12 0x0000000000418ced in rb_exec_end_proc () at eval_jump.c:134
#13 0x0000000000418df4 in ruby_finalize_0 () at eval.c:107
#14 0x0000000000418ea9 in ruby_cleanup (ex=0) at eval.c:142
#15 0x0000000000419209 in ruby_stop (ex=8242580) at eval.c:217
#16 0x000000000046f65d in rb_f_fork (obj=) at process.c:2623
#17 0x00000000004fedef in vm_call_cfunc (th=0x7dc500, reg_cfp=0x2aec324a5c48, num=0, recv=16884920, blockptr=0x2aec324a5c70, flag=0, me=0x8bbdd0)
    at vm_insnhelper.c:385
#18 0x0000000000511b43 in vm_call_method (th=0x7dc500, cfp=dwarf2_read_address: Corrupted DWARF expression.
) at vm_insnhelper.c:510
#19 0x00000000005045d6 in vm_exec_core (th=0x7dc500, initial=) at insns.def:994
#20 0x0000000000509a04 in vm_exec (th=0x7dc594) at vm.c:1099
#21 0x000000000050be7b in vm_call0 (th=0x7dc500, recv=16884920, id=448, argc=1, argv=0x2aec323a60b0, me=0x1217660) at vm_eval.c:62
#22 0x00000000005105cd in rb_funcall2 (recv=16884920, mid=448, argc=1, argv=0x2aec323a60b0) at vm_eval.c:266
#23 0x0000000000449402 in rb_class_new_instance (argc=1, argv=0x2aec323a60b0, klass=) at object.c:1494
#24 0x00000000004fedef in vm_call_cfunc (th=0x7dc500, reg_cfp=0x2aec324a5e00, num=1, recv=16885080, blockptr=0x0, flag=0, me=0x84e330)
    at vm_insnhelper.c:385
#25 0x0000000000511b43 in vm_call_method (th=0x7dc500, cfp=dwarf2_read_address: Corrupted DWARF expression.
) at vm_insnhelper.c:510
#26 0x00000000005045d6 in vm_exec_core (th=0x7dc500, initial=) at insns.def:994
#27 0x0000000000509a04 in vm_exec (th=0x7dc594) at vm.c:1099
#28 0x0000000000509d71 in rb_iseq_eval (iseqval=16382880) at vm.c:1296
#29 0x000000000054e875 in rb_load_internal (fname=16492120, wrap=) at load.c:293
#30 0x000000000054e9cc in rb_f_load (argc=, argv=) at load.c:366
#31 0x00000000004fedef in vm_call_cfunc (th=0x7dc500, reg_cfp=0x2aec324a5f08, num=1, recv=8631080, blockptr=0x0, flag=0, me=0x8c61f0)
    at vm_insnhelper.c:385
#32 0x0000000000511b43 in vm_call_method (th=0x7dc500, cfp=dwarf2_read_address: Corrupted DWARF expression.
) at vm_insnhelper.c:510
#33 0x00000000005045d6 in vm_exec_core (th=0x7dc500, initial=) at insns.def:994

child process:

(gdb) where
#0  0x00002aec325a11ad in pthread_cond_destroy@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000051622e in terminate_atfork_i (key=, val=, current_th=)
    at thread_pthread.c:178
#2  0x00000000004b00b9 in st_foreach (table=0x7dc8f0, func=0x516120 , arg=23356528) at st.c:708
#3  0x0000000000513d88 in rb_thread_atfork () at thread.c:2649
#4  0x000000000046f60e in rb_f_fork (obj=) at process.c:2618
#5  0x00000000004fedef in vm_call_cfunc (th=0x1646470, reg_cfp=0x2aaaae63bb98, num=0, recv=17068080, blockptr=0x2aaaae63bbc0, flag=202, me=0x8bbdd0)
    at vm_insnhelper.c:385
#6  0x0000000000511b43 in vm_call_method (th=0x1646470, cfp=dwarf2_read_address: Corrupted DWARF expression.
) at vm_insnhelper.c:510
#7  0x00000000005045d6 in vm_exec_core (th=0x1646470, initial=) at insns.def:994
#8  0x0000000000509a04 in vm_exec (th=0x1573918) at vm.c:1099
#9  0x000000000050a0c3 in invoke_block_from_c (th=0x1646470, block=0x2aaaae63bf30, self=17092960, argc=0, argv=0x0, blockptr=0x0, cref=0x0)
    at vm.c:543
#10 0x000000000050a948 in loop_i () at vm.c:573
#11 0x0000000000416ae9 in rb_rescue2 (b_proc=0x50a910 , data1=0, r_proc=0, data2=0) at eval.c:574
#12 0x00000000004fe739 in rb_f_loop (self=17092960) at vm_eval.c:605
#13 0x00000000004fedef in vm_call_cfunc (th=0x1646470, reg_cfp=0x2aaaae63bf08, num=0, recv=17092960, blockptr=0x2aaaae63bf30, flag=202, me=0x864210)
    at vm_insnhelper.c:385
#14 0x0000000000511b43 in vm_call_method (th=0x1646470, cfp=dwarf2_read_address: Corrupted DWARF expression.
) at vm_insnhelper.c:510
#15 0x00000000005045d6 in vm_exec_core (th=0x1646470, initial=) at insns.def:994
#16 0x0000000000509a04 in vm_exec (th=0x1573918) at vm.c:1099
#17 0x000000000050a0c3 in invoke_block_from_c (th=0x1646470, block=0x1815bb0, self=17092960, argc=0, argv=0x0, blockptr=0x0, cref=0x0) at vm.c:543
#18 0x000000000050a5d2 in rb_vm_invoke_proc (th=0x1646470, proc=0x1815bb0, self=17092960, argc=0, argv=0x10470b8, blockptr=0x0) at vm.c:590
#19 0x00000000005197d5 in thread_start_func_2 (th=0x1646470, stack_start=0x40511120) at thread.c:387
#20 0x00000000005199ee in thread_start_func_1 (th_ptr=0x1573918) at thread_pthread.c:351
#21 0x00002aec3259d1b5 in start_thread () from /lib64/libpthread.so.0
#22 0x00002aec3314336d in clone () from /lib64/libc.so.6
#23 0x0000000000000000 in ?? ()
igrigorik commented 14 years ago

A few more observations: if I kill the forked process, god responds to commands, but it doesn't bring up the process on which it blocked in the first place. However, if I issue an explicit "god restart proc", it does seem to come back.

trying to figure out the actual pattern

eric commented 14 years ago

This is the same issue as #8.

mojombo commented 14 years ago

The Ruby 1.9 mutex patches are now in God 0.9.0. If anyone has additional feedback on how well it works now, I'd much appreciate it!

ambethia commented 14 years ago

It works fine for a while, but eventually:

$ ruby --version

ruby 1.9.1p420 (2010-02-04) [x86_64-linux]

$ god --version

Version 0.9.0

$ top

top - 11:31:46 up 1 day,  1:24,  1 user,  load average: 1.35, 1.19, 1.11
Tasks:  89 total,   1 running,  88 sleeping,   0 stopped,   0 zombie
Cpu(s): 52.5%us,  0.3%sy,  0.0%ni, 47.0%id,  0.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4058360k total,  2327260k used,  1731100k free,   144404k buffers
Swap: 11888060k total,        0k used, 11888060k free,  1877324k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                         
 1107 root      20   0 87852  22m 2948 S  100  0.6   1498:55 god                                                                                                                                                                                                              
14996 deploy    20   0  174m  69m 3128 S    5  1.8   0:49.47 ruby                                                                                                                                                                                                             
  822 mongodb   20   0 4210m 1.5g 1.5g S    0 39.7   0:58.81 mongod                                                                                                                                                                                                           
 1189 www-data  20   0  414m 4908 1832 S    0  0.1   0:03.24 apache2
[...]

$ sudo strace -p 1107

Process 1107 attached - interrupt to quit
futex(0xf7c594, FUTEX_WAIT_PRIVATE, 5, NULL

$ sudo gdb --pid=1107

[...]
(gdb) where
\#0  0x00007f08fc6cf5a9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
\#1  0x00000000004e5245 in native_sleep (th=0xf7c500, tv=0x0) at thread_pthread.c:614
\#2  0x00000000004e63b1 in sleep_forever (th=0xf7c500, deadlockable=1) at thread.c:774
\#3  0x00000000004e641a in thread_join_sleep (arg=<value optimized out>) at thread.c:608
\#4  0x0000000000418adb in rb_ensure (b_proc=<value optimized out>, data1=<value optimized out>, e_proc=<value optimized out>, data2=<value optimized out>) at eval.c:680
\#5  0x00000000004e7d5c in thread_join (target_th=0x2128d90, delay=1e+30) at thread.c:641
\#6  0x00000000004e7e42 in thread_join_m (argc=<value optimized out>, argv=<value optimized out>, self=<value optimized out>) at thread.c:722
\#7  0x00000000004d1e12 in vm_call_cfunc (th=<value optimized out>, reg_cfp=<value optimized out>, num=<value optimized out>, id=<value optimized out>, oid=<value optimized out>, recv=2, klass=16421720, flag=0, mn=0xfa66b8, blockptr=0x0) at vm_insnhelper.c:384
\#8  0x00000000004dc019 in vm_call_method (th=0xf7c500, cfp=0x7f08fcaefa10, num=0, blockptr=0x213, flag=0, id=0, mn=0xfa6680, recv=22885408) at vm_insnhelper.c:517
\#9  0x00000000004d6f0c in vm_exec_core (th=0xf7c500, initial=<value optimized out>) at insns.def:999
\#10 0x00000000004d9ec2 in vm_exec (th=<value optimized out>) at vm.c:1082
\#11 0x00000000004db340 in invoke_block_from_c (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:526
\#12 rb_vm_invoke_proc (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:573
\#13 0x000000000041a960 in rb_exec_end_proc () at eval_jump.c:133
\#14 0x000000000041aa24 in ruby_finalize_0 () at eval.c:108
\#15 0x000000000041aafe in ruby_cleanup (ex=<value optimized out>) at eval.c:143
\#16 0x000000000041ad39 in ruby_stop (ex=16237972) at eval.c:218
\#17 0x0000000000461107 in rb_f_fork (obj=<value optimized out>) at process.c:2394
\#18 0x00000000004d1e12 in vm_call_cfunc (th=<value optimized out>, reg_cfp=<value optimized out>, num=<value optimized out>, id=<value optimized out>, oid=<value optimized out>, recv=2, klass=16792552, flag=8, mn=0xfbb8e8, blockptr=0x7f08fcaefc18)
    at vm_insnhelper.c:384
\#19 0x00000000004dc019 in vm_call_method (th=0xf7c500, cfp=0x7f08fcaefbf0, num=0, blockptr=0x213, flag=8, id=0, mn=0xfbb878, recv=27407576) at vm_insnhelper.c:517
\#20 0x00000000004d6f0c in vm_exec_core (th=0xf7c500, initial=<value optimized out>) at insns.def:999
\#21 0x00000000004d9ec2 in vm_exec (th=<value optimized out>) at vm.c:1082
---Type <return> to continue, or q <return> to quit---
igrigorik commented 13 years ago

Bump. Same issue, still there on 1.9.2 p0, another trace: https://gist.github.com/9ccfda78c52356e74291

igrigorik commented 13 years ago

Good news.. we have a Ruby patch which address this. Full details: https://gist.github.com/762807#comments

(unfortunately rubylang redmine is down atm)

dbackeus commented 13 years ago

Is this fix included in the new 1.9.2 release (p180)?

igrigorik commented 13 years ago

According to the changelog @ http://svn.ruby-lang.org/repos/ruby/tags/v1_9_2_180/ChangeLog

Unfortunately, don't see it..

http://redmine.ruby-lang.org/repositories/revision/ruby-19?rev=30272