influxdata / kapacitor

Open source framework for processing, monitoring, and alerting on time series data
MIT License
2.31k stars 493 forks source link

Trace logging seems to be on permanently #1377

Open pmyjavec opened 7 years ago

pmyjavec commented 7 years ago

While running v1.2.2 Kapacitor seems to be dumping a bunch of traces to standard out.

Here are some of the example traces:

goroutine 2323 [chan receive, 10 minutes]:
github.com/influxdata/kapacitor.(*group).run(0xc4207d4050, 0x0, 0x0)
        /home/builder/go/src/github.com/influxdata/kapacitor/join.go:324 +0xb2
github.com/influxdata/kapacitor.(*JoinNode).getGroup.func1(0xc4207d4050, 0xc4204e6160, 0xc4204e3020)
        /home/builder/go/src/github.com/influxdata/kapacitor/join.go:278 +0x35
created by github.com/influxdata/kapacitor.(*JoinNode).getGroup
        /home/builder/go/src/github.com/influxdata/kapacitor/join.go:286 +0x2c5

net.runtime_pollWait(0x7f770b3205b0, 0x72, 0xa)
        /usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc479db5db0, 0x72, 0xc455e1e7b0, 0xc420016140)
net.(*pollDesc).waitRead(0xc479db5db0, 0x12116c0, 0xc420016140)
net.(*netFD).Read(0xc479db5d50, 0xc467381000, 0x1000, 0x1000, 0x0, 0x12116c0, 0xc420016140)
net.(*conn).Read(0xc42036c5f8, 0xc467381000, 0x1000, 0x1000, 0x0, 0x0, 0x0)

Here is a link to the running configuration: https://gist.github.com/pmyjavec/57910435a55ef4ba11e7982b40e86c66

pmyjavec commented 7 years ago

@nathanielc,

My bad, I think I got it wrong yesterday.

According to @kostasb the behaviour is pretty normal and is triggered when the process runs out of memory and generates a crash log does that sound accurate?

If this is the case, do you think it's possible to surround the crash log with something to indicate that's what we're looking at? It would be really neat to have something like:

2017/05/15 18:35:19 --- BEGIN CRASH LOG ---

goroutine 2323 [chan receive, 10 minutes]:
github.com/influxdata/kapacitor.(*group).run(0xc4207d4050, 0x0, 0x0)
        /home/builder/go/src/github.com/influxdata/kapacitor/join.go:324 +0xb2
github.com/influxdata/kapacitor.(*JoinNode).getGroup.func1(0xc4207d4050, 0xc4204e6160, 0xc4204e3020)
        /home/builder/go/src/github.com/influxdata/kapacitor/join.go:278 +0x35
created by github.com/influxdata/kapacitor.(*JoinNode).getGroup
        /home/builder/go/src/github.com/influxdata/kapacitor/join.go:286 +0x2c5

net.runtime_pollWait(0x7f770b3205b0, 0x72, 0xa)
        /usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc479db5db0, 0x72, 0xc455e1e7b0, 0xc420016140)
net.(*pollDesc).waitRead(0xc479db5db0, 0x12116c0, 0xc420016140)
net.(*netFD).Read(0xc479db5d50, 0xc467381000, 0x1000, 0x1000, 0x0, 0x12116c0, 0xc420016140)
net.(*conn).Read(0xc42036c5f8, 0xc467381000, 0x1000, 0x1000, 0x0, 0x0, 0x0)

2017/05/15 18:35:19 --- END CRASH LOG ---

Thanks!

nathanielc commented 7 years ago

@pmyjavec It would be nice to surround it but its part of the Go runtime so I don't think we have access to how it outputs.

The fact that its crashing is not good, can you share the very first lines of a crash, sometimes the crash is an OOMing error, other times its a bug in the code, so it would be good to confirm its not the later.

pmyjavec commented 7 years ago

@nathanielc please see the issue referenced by @kostasb for details on the crashes

pmyjavec commented 7 years ago

@nathanielc,

Actually I managed to find the library used that can wrap the panic in something informative. Here it is: https://github.com/mitchellh/panicwrap

Using that would be really cool, what do you think? By the way I would be interested in helping implement something like that :)

nathanielc commented 7 years ago

Good find I wasn't able to find anything like that in my quick search. Unfortunately I do not think it is a viable solution. The panicwrap is a process wrapper not simply a library. So that would mean that to run Kapacitor you would have to run two process, the panicwrap process and the actual Kapacitor daemon process. That change has significant repercussions for most deployment mechanisms.

pmyjavec commented 7 years ago

Hey @nathanielc, I would like to understand how this would have significant repercussions for most deployment mechanisms? As a user / SRE working with Kapacitor I can't see how this would be a negative change or force me to change my deployment strategy in anyway.

Users wouldn't actually need to run two commands, they would still just use kapacitord as the entry point and panic wrapper just re-executes the program.

Sure this would run the main() loop as a sub-process / fork, but that's it's purpose and is really quite transparent to the user while adding major benefits such as useful control over the crash logging and provide a more helpful user experience.