sni / lmd

Livestatus Multitool Daemon - Create livestatus federation from multiple sources
https://labs.consol.de/omd/packages/lmd/
GNU General Public License v3.0
42 stars 31 forks source link

Panic: assignment to entry in nil map #53

Closed danirod closed 5 years ago

danirod commented 5 years ago

Saddened to announce something similar to #46 is happening here again. However, the stack trace message is slightly different, and the panic reason is different as well: assignment to entry in nil map, located in:

    runtime/debug.Stack(0xc4201c4030, 0x9472c1, 0xe)
            /root/.gvm/gos/go1.9.2/src/runtime/debug/stack.go:24 +0xa7
    main.logPanicExitPeer(0xc42028a0c0)
            /var/omnibus/src/src/github.com/sni/lmd/lmd/peer.go:2754 +0x12f
    panic(0x8b9ce0, 0x9a4dd0)
            /root/.gvm/gos/go1.9.2/src/runtime/panic.go:491 +0x283
    main.(*DataTable).AddItem(...)
            /var/omnibus/src/src/github.com/sni/lmd/lmd/peer.go:153
    main.(*Peer).UpdateDeltaCommentsOrDowntimes(0xc42028a0c0, 0x943b38, 0x8, 0x0, 0x0)
            /var/omnibus/src/src/github.com/sni/lmd/lmd/peer.go:1128 +0x1453
    main.(*Peer).UpdateDeltaTables(0xc42028a0c0, 0x944e05)
            /var/omnibus/src/src/github.com/sni/lmd/lmd/peer.go:798 +0xb1d
    main.(*Peer).periodicUpdate(0xc42028a0c0, 0xc42ee75e7e, 0xc42ee75e80)
            /var/omnibus/src/src/github.com/sni/lmd/lmd/peer.go:415 +0x370
    main.(*Peer).updateLoop(0xc42028a0c0)
            /var/omnibus/src/src/github.com/sni/lmd/lmd/peer.go:346 +0x30b
    main.(*Peer).Start.func1(0xc42028a0c0)
            /var/omnibus/src/src/github.com/sni/lmd/lmd/peer.go:245 +0x59
    created by main.(*Peer).Start
            /var/omnibus/src/src/github.com/sni/lmd/lmd/peer.go:242 +0x17f

This time I have more detail to provide, although I haven't been able to reproduce this bug locally despite trying hard. I'm just providing data discovered after analyzing days of lmd logs in one of our client installations. I even increased the logging level with the hope of catching some extra details, but I've had no luck (and received a gigabyte log file as a side effect).

So the thing that I see is that the panic always happen after connection with the site is lost. In order words, I always see up to a minute before the panic happens the warning site went offline: dial unix ...{path to livestatus socket file}... : connect: resource temporarily unavailable in the logs. It's not reciprocal: getting a peer down does not mean that lmd will crash, although it always crashes after a peer is down. I'm wondering if the same is true for issue #46, but haven't checked yet.

Happy to provide more feedback if needed, although I still have no idea how to reproduce this locally because I don't understand how is the peer failing in this case.

danirod commented 5 years ago

From what I see in the logs, when the peer goes down, the following should happen.

  1. lmd cannot contact with the site, so it logs the 'site went offline' message.
  2. Further requests received by lmd are returned empty, as lmd has no data for that peer. by looking at the code, I see that whenever a peer disconnects, lmd clears the data for that peer, so it makes sense.
  3. Eventually lmd is able to reconnect to the peer and starts fetching initial data again. It logs "fetched %d initial %s" messages. When it finishes, it logs "objects are created".
  4. Further requests sent to lmd are handled properly because the datatables have been populated.

Panics always happen between stage 2 and 3.

sni commented 5 years ago

Sounds like some racecondition anywhere. Its a bit more complicated as well, between 1) and 2) there is a soft down state and LMD keeps the data to answer requests during backend reloads. And after some time, LMD puts the peer into a hard down and removes all data. Although the UpdateDeltaCommentsOrDowntimes function should neither run in any of soft or hard down states.

sni commented 5 years ago

please try again with lasted HEAD. I added an additional test to see if that index exists.

danirod commented 5 years ago

Alright, thank you!

I never got able to reproduce this issue locally, so we'll just check that no panics happen in our production server once this patch gets deployed. No other critical errors arise from this bug as the process manager takes care of restarting lmd if it panics, so I'm closing this issue as I don't have an ETA on when will it happen. If I don't re-open the issue, it'll be good news.