Closed danirod closed 5 years ago
did you really request logs? because the panic say GET comments
. Anyway, could you try the latest HEAD, i fixed a few things.
Okay, I'm really sorry for keeping this issue on the line for so many days. Anyway, we could still reproduce this on a version compiled from HEAD, at least the HEAD as it was 13 days ago, which was at 0ffe74514e122f3da45543109842be22c943e317 (here is the tree).
However, here is an update: I've come to the conclusion that this is not related to log requests at all, after looking at the source code backing this stack trace, and after checking for 2 to 3 weeks of rotated lmd logs in our servers.
The stack trace indicates that the panic happens when updating delta tables in updateLoop
. Now, correct me if I'm wrong. But the peers are updated in a separate goroutine that is started when the peer starts.
https://github.com/sni/lmd/blob/0ffe74514e122f3da45543109842be22c943e317/lmd/peer.go#L229-L237 (And here is the Sourcegraph for that code excerpt - wonderful tool for navigating through function definitions and function calls, by the way)
In fact, after grepping for all the ocurrences of the "got null index" panic in our logs using the -B 1 flag (so that it also returns one line before the grepped line), I see that the table that it's being requested may be different each time. Sometimes it's logs, sometimes is downtimes, or hosts. Therefore, the fact that I saw a log request right before the panic trace may be pure coincidence with no additional relationship. It could just be a line that was sent to stdout milliseconds before receiving a panic. (Those request logs are noisy at some points during the day.)
I'm sorry for the confussion on this, I will update the issue title. However, as I said, we could still replicate this with a more recent version. What is interesting is that, after dumping the stack trace for this panic, the request and the response is attached.
The request is the same as the one in my first message of the issue. It executes a request to the comments table filtering by a lot of comment IDs. The response is also present, but pasting it here publicly would get me in trouble - however, it honors the OutputFormat
request header and it looks like a big JSON array of JSON arrays. I'll inspect the contents of the Shinken's response to see if there is something interesting there, such as an invalid field or something like that. Who knows, maybe it's another bug on Shinken's roof.
its true that each peer starts a separate go routine to update its data perdiodically. However, there are happening many things parallel, for example incoming requests. In theory, there is no point, when there is no index, except at startup or if the core reloaded and everything is fetched fresh. Thats why LMD panics at that point, because something is serious going wrong then. While logging the panic, it prints the current query, so it might at least be related to many things running at the same time. Usually the peer uses locking to prevent such things, but maybe we missed a lock somewhere and a query trys to read an index when its removed.
May be related to this? https://github.com/sni/lmd/commit/e388b51053e417e6752efd0fb18bc563a2470622
i don't think so, that commit is not a single day old.
I mean this commit maybe fix the problem? table comment, panic with indexes.... it smells quite similar.
will close this for now, please raise a new issue if there are things still going wrong.
We're seeing this panic being thrown in lmd.log ~when requests for the logs table are received by lmd. These requests are initiated by a batch that is requesting multiple logs. The request that is being handled when this panic happens gets interrumpted; further requests are never done because of a connection error~.
Update: as explained in a comment below, I got to see that these panics have nothing to do with requests to a particular table such as logs.
We're still testing this issue looking for additional clues that don't involve restarting critical services such as shinken, as this is a big deployment where the downtime during the reload of the shinken configuration would take quite some time. However, I'm opening this issue in case this is actually a bug in lmd itself.