sni / lmd

Livestatus Multitool Daemon - Create livestatus federation from multiple sources
https://labs.consol.de/omd/packages/lmd/
GNU General Public License v3.0
42 stars 31 forks source link

Panic: got no index when updating delta comments #46

Closed danirod closed 5 years ago

danirod commented 5 years ago

We're seeing this panic being thrown in lmd.log ~when requests for the logs table are received by lmd. These requests are initiated by a batch that is requesting multiple logs. The request that is being handled when this panic happens gets interrumpted; further requests are never done because of a connection error~.

Update: as explained in a comment below, I got to see that these panics have nothing to do with requests to a particular table such as logs.

We're still testing this issue looking for additional clues that don't involve restarting critical services such as shinken, as this is a big deployment where the downtime during the reload of the shinken configuration would take quite some time. However, I'm opening this issue in case this is actually a bug in lmd itself.

[2018-11-07 11:31:13][Panic][peer.go:1108] got null index
[2018-11-07 11:31:13][Error][peer.go:2696] [livestatus-socket by socket] Panic: got null index
[2018-11-07 11:31:13][Error][peer.go:2697] [livestatus-socket by socket] goroutine 42 [running]:
runtime/debug.Stack(0xc4201b6600, 0x92fc34, 0xe)
    /home/danirod/.gvm/gos/go1.9.2/src/runtime/debug/stack.go:24 +0xa7
main.logPanicExitPeer(0xc420120180)
    /home/danirod/go/src/github.com/sni/lmd/lmd/peer.go:2697 +0x12f
panic(0x87e9a0, 0xc42358eb80)
    /root/.gvm/gos/go1.9.2/src/runtime/panic.go:491 +0x283
github.com/kdar/factorlog.(*FactorLog).Panicf(0xc4201b6600, 0x92fde6, 0xe, 0x0, 0x0, 0x0)
    /home/danirod/go/src/github.com/kdar/factorlog/factorlog.go:411 +0x145
main.(*Peer).UpdateDeltaCommentsOrDowntimes(0xc420120180, 0x92c49d, 0x8, 0x0, 0x0)
    /home/danirod/go/src/github.com/sni/lmd/lmd/peer.go:1108 +0x177b
main.(*Peer).UpdateDeltaTables(0xc420120180, 0x92d76a)
    /home/danirod/go/src/github.com/sni/lmd/lmd/peer.go:779 +0xb1d
main.(*Peer).periodicUpdate(0xc420120180, 0xc420073e7e, 0xc420073e80)
    /home/danirod/go/src/github.com/sni/lmd/lmd/peer.go:398 +0x370
main.(*Peer).updateLoop(0xc420120180)
    /home/danirod/go/src/github.com/sni/lmd/lmd/peer.go:329 +0x302
main.(*Peer).Start.func1(0xc420120180)
    /home/danirod/go/src/github.com/sni/lmd/lmd/peer.go:228 +0x59
created by main.(*Peer).Start
    /home/danirod/go/src/github.com/sni/lmd/lmd/peer.go:225 +0x17f

[2018-11-07 11:31:13][Error][peer.go:2699] [livestatus-socket by socket] LastQuery:
[2018-11-07 11:31:13][Error][peer.go:2700] [livestatus-socket by socket] GET comments
ResponseHeader: fixed16
OutputFormat: json
Columns: author comment entry_time entry_type expires expire_time id is_service persistent source type host_name service_description host_contacts service_contacts
Filter: id = 429
[and then a list of about a thousand filters on the id column that I've omitted for brevity]
sni commented 5 years ago

did you really request logs? because the panic say GET comments. Anyway, could you try the latest HEAD, i fixed a few things.

danirod commented 5 years ago

Okay, I'm really sorry for keeping this issue on the line for so many days. Anyway, we could still reproduce this on a version compiled from HEAD, at least the HEAD as it was 13 days ago, which was at 0ffe74514e122f3da45543109842be22c943e317 (here is the tree).

However, here is an update: I've come to the conclusion that this is not related to log requests at all, after looking at the source code backing this stack trace, and after checking for 2 to 3 weeks of rotated lmd logs in our servers.

The stack trace indicates that the panic happens when updating delta tables in updateLoop. Now, correct me if I'm wrong. But the peers are updated in a separate goroutine that is started when the peer starts.

https://github.com/sni/lmd/blob/0ffe74514e122f3da45543109842be22c943e317/lmd/peer.go#L229-L237 (And here is the Sourcegraph for that code excerpt - wonderful tool for navigating through function definitions and function calls, by the way)

In fact, after grepping for all the ocurrences of the "got null index" panic in our logs using the -B 1 flag (so that it also returns one line before the grepped line), I see that the table that it's being requested may be different each time. Sometimes it's logs, sometimes is downtimes, or hosts. Therefore, the fact that I saw a log request right before the panic trace may be pure coincidence with no additional relationship. It could just be a line that was sent to stdout milliseconds before receiving a panic. (Those request logs are noisy at some points during the day.)

I'm sorry for the confussion on this, I will update the issue title. However, as I said, we could still replicate this with a more recent version. What is interesting is that, after dumping the stack trace for this panic, the request and the response is attached.

The request is the same as the one in my first message of the issue. It executes a request to the comments table filtering by a lot of comment IDs. The response is also present, but pasting it here publicly would get me in trouble - however, it honors the OutputFormat request header and it looks like a big JSON array of JSON arrays. I'll inspect the contents of the Shinken's response to see if there is something interesting there, such as an invalid field or something like that. Who knows, maybe it's another bug on Shinken's roof.

sni commented 5 years ago

its true that each peer starts a separate go routine to update its data perdiodically. However, there are happening many things parallel, for example incoming requests. In theory, there is no point, when there is no index, except at startup or if the core reloaded and everything is fetched fresh. Thats why LMD panics at that point, because something is serious going wrong then. While logging the panic, it prints the current query, so it might at least be related to many things running at the same time. Usually the peer uses locking to prevent such things, but maybe we missed a lock somewhere and a query trys to read an index when its removed.

dgilm commented 5 years ago

May be related to this? https://github.com/sni/lmd/commit/e388b51053e417e6752efd0fb18bc563a2470622

sni commented 5 years ago

i don't think so, that commit is not a single day old.

dgilm commented 5 years ago

I mean this commit maybe fix the problem? table comment, panic with indexes.... it smells quite similar.

sni commented 5 years ago

will close this for now, please raise a new issue if there are things still going wrong.