Closed danirod closed 5 years ago
From what I see in the logs, when the peer goes down, the following should happen.
Panics always happen between stage 2 and 3.
Sounds like some racecondition anywhere. Its a bit more complicated as well, between 1) and 2) there is a soft down state and LMD keeps the data to answer requests during backend reloads. And after some time, LMD puts the peer into a hard down and removes all data.
Although the UpdateDeltaCommentsOrDowntimes
function should neither run in any of soft or hard down states.
please try again with lasted HEAD. I added an additional test to see if that index exists.
Alright, thank you!
I never got able to reproduce this issue locally, so we'll just check that no panics happen in our production server once this patch gets deployed. No other critical errors arise from this bug as the process manager takes care of restarting lmd if it panics, so I'm closing this issue as I don't have an ETA on when will it happen. If I don't re-open the issue, it'll be good news.
Saddened to announce something similar to #46 is happening here again. However, the stack trace message is slightly different, and the panic reason is different as well:
assignment to entry in nil map
, located in:This time I have more detail to provide, although I haven't been able to reproduce this bug locally despite trying hard. I'm just providing data discovered after analyzing days of lmd logs in one of our client installations. I even increased the logging level with the hope of catching some extra details, but I've had no luck (and received a gigabyte log file as a side effect).
So the thing that I see is that the panic always happen after connection with the site is lost. In order words, I always see up to a minute before the panic happens the warning
site went offline: dial unix ...{path to livestatus socket file}... : connect: resource temporarily unavailable
in the logs. It's not reciprocal: getting a peer down does not mean that lmd will crash, although it always crashes after a peer is down. I'm wondering if the same is true for issue #46, but haven't checked yet.Happy to provide more feedback if needed, although I still have no idea how to reproduce this locally because I don't understand how is the peer failing in this case.