tomwilkie / prom-run

Periodically run a command and exports its return code as a prometheus metric.
116 stars 12 forks source link

Hang in reap() after timeout #7

Open bboreham opened 6 years ago

bboreham commented 6 years ago

Return from the http endpoint:

  <h2>Prometheus Command Runner</h2>
  <p>"/data/service-conf/ansible/ansible-diff prod" output:</p>
  <pre>command timed out</pre>
  <p>Run at 2018-11-03 19:46:53.80704496 &#43;0000 UTC took 10m0.00211331s<p>

Stack traces via kill -SIGQUIT:

goroutine 7 [syscall, 3982 minutes]:
syscall.Syscall6(0x3d, 0xffffffffffffffff, 0xc4205a3c44, 0x0, 0x0, 0x0, 0x0, 0x7fbbd37a8000, 0x0, 0x0)
    /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.wait4(0xffffffffffffffff, 0xc4205a3c44, 0x0, 0x0, 0x24, 0xc4205a3cc0, 0x3)
    /usr/local/go/src/syscall/zsyscall_linux_amd64.go:172 +0x88
syscall.Wait4(0xffffffffffffffff, 0xc4205a3c90, 0x0, 0x0, 0x3, 0x0, 0x0)
    /usr/local/go/src/syscall/syscall_linux.go:256 +0x51
main.reap()
    /home/ubuntu/.go_workspace/src/github.com/weaveworks/docker-ansible/vendor/github.com/tomwilkie/prom-run/main.go:35 +0xa9
main.main.func3(0xc420015a30, 0x7ffd23561149, 0x27, 0xc42000a130, 0x1, 0x1, 0xc420015a40, 0xc420015a90, 0xc42012a2c0, 0xc420015a80, ...)
    /home/ubuntu/.go_workspace/src/github.com/weaveworks/docker-ansible/vendor/github.com/tomwilkie/prom-run/main.go:113 +0x633
created by main.main
    /home/ubuntu/.go_workspace/src/github.com/weaveworks/docker-ansible/vendor/github.com/tomwilkie/prom-run/main.go:133 +0x333

goroutine 1351637 [chan receive, 3982 minutes]:
os/exec.(*Cmd).Wait(0xc42067c2c0, 0x0, 0x0)
    /usr/local/go/src/os/exec/exec.go:442 +0x124
os/exec.(*Cmd).Run(0xc42067c2c0, 0xc42005a150, 0x824d70)
    /usr/local/go/src/os/exec/exec.go:279 +0x48
os/exec.(*Cmd).CombinedOutput(0xc42067c2c0, 0x50a514, 0xc4205b6e80, 0xffffffffffffffff, 0xc4206aef90, 0x53df89)
    /usr/local/go/src/os/exec/exec.go:494 +0x140
main.main.func3.1(0xc42067c2c0, 0xc420482060)
    /home/ubuntu/.go_workspace/src/github.com/weaveworks/docker-ansible/vendor/github.com/tomwilkie/prom-run/main.go:88 +0x2f
created by main.main.func3
    /home/ubuntu/.go_workspace/src/github.com/weaveworks/docker-ansible/vendor/github.com/tomwilkie/prom-run/main.go:90 +0x32d

I believe the source is https://github.com/tomwilkie/prom-run/blob/1e15c08f93a0286d8945b784ccad2dbdf0000def/main.go#L35

So both goroutines calling wait are stuck. I wonder if there is a race?

tomwilkie commented 6 years ago

What version are you running? From https://github.com/weaveworks/docker-ansible/blob/master/vendor/manifest I see its 1e15c08f93a0286d8945b784ccad2dbdf0000def

I'd try latest; I re-wrote this a while back to fix a bunch of issues.

bboreham commented 6 years ago

I did look at the latest code, and it doesn't seem any different in this respect - it still calls Wait4() to wait indefinitely.

tomwilkie commented 6 years ago

But it should never call it concurrently. On Wed, 7 Nov 2018 at 17:58, Bryan Boreham notifications@github.com wrote:

I did look at the latest code, and it doesn't seem any different in this respect - it still calls Wait4() to wait indefinitely.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/tomwilkie/prom-run/issues/7#issuecomment-436718084, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbGhXYRKDwT1xgNogXy20P60-wtCmP8ks5usx9TgaJpZM4YQkkQ .

bboreham commented 6 years ago

Look at the stack traces - exec is calling wait and reap is calling wait.

tomwilkie commented 6 years ago

Unless I'm mistaken, master never calls exec and wait concurrently...

bboreham commented 6 years ago

However, reflecting on the ps output I saw, I wonder if the sub-process has not exited at all.

From the docs I expect cancel() to kill it.

bboreham commented 6 years ago

The code on master calls defer cancel() - how would that work on a timeout?

tomwilkie commented 6 years ago

Thats just idiomatic use of context; its a noop in this case.

This is where it ends up being used: https://github.com/golang/go/blob/go1.11/src/os/exec/exec.go#L408, and the waitDone channel will be closed first.