Livestatus massive command submission can hang if the response is not handled

riton commented 2 years ago

Description

We're using go-livestatus to submit a huge amount of passive checks results to naemon through the livestatus unix socket. Very important detail: we're opening only one connection to submit 900+ check results.

Many of our probes submit check results for production and test clusters. Sometimes the host and services exist in the naemon configuration. Sometimes it doesn't. This is no real problem for naemon. It simply emits a warning in its logs and ignores the event.

Everything was working fine, until one of our probe started to submit a very huge amount of check results for services that do not exist in naemon using livestatus. And then, we saw the livestatus connection blocking after ~500 results and no more check results could be submitted through the connection.

What we think happened

After a little analysis, we were able to identify the problem. On successful commands, livestatus does NOT responds anything. On failed commands, livestatus responds with the error.

The current go-livestatus code does not reads a command response from livestatus. We think that after a certain amount of failed commands on the same connection, livestatus blocks if nothing reads its responses. I didn't dig into the livestatus code so I'm not a 100% sure of the exact reason here.

Can we reproduce the problem

Yes, the following code can reproduce the problem:

package main

import (
    "fmt"
    "time"

    livestatus "github.com/vbatoufflet/go-livestatus"
    lnagios "github.com/vbatoufflet/go-livestatus/nagios"
)

func main() {
    c := livestatus.NewClient("unix", "/var/cache/naemon/live")
    defer c.Close()

    var lCmds []livestatus.Command
    for i := 0; i < 900; i++ {
        lCmd := lnagios.ProcessServiceCheckResult(
            fmt.Sprintf("hostname-no-exist-%d", i),
            fmt.Sprintf("service-description-no-exist-%d", i),
            0,
            "Output",
        )
        lCmds = append(lCmds, *lCmd)
    }

    for idx, cmd := range lCmds {
        resp, err := c.Exec(cmd)
        if err != nil {
            panic(fmt.Sprintf("[%d] failed: %v\n", idx, err))
        }

        fmt.Printf("[%d] resp = %v\n", idx, resp)
    }
}

This program submits 900 check results to naemon for a host and service that does not exist in the naemon configuration. livestatus will then return an error.

Software versions

`naemon` server

[root@rocky8 ~]# rpm -qa '*naemon*' | sort
libnaemon-1.3.0-13.16.x86_64
naemon-core-1.3.0-13.16.x86_64
naemon-livestatus-1.3.0-11.16.x86_64
naemon-thruk-1.3.0-10.16.noarch

Go client

# go.mod
[...]
require github.com/vbatoufflet/go-livestatus v0.0.0-20190218065636-65182dd594b0
[...]

❯ go version
go version go1.17.7 linux/amd64

Known workarounds

A simple workaround is to Close() the livestatus connection after a small number of check results submitted and open a new one to continue This creates other problem (like the livestatus socket being temporary unavailable) when a burst of check results occurs. But this workaround is available right now and easy to use.

Discussion

We're currently using the close and reopen the livestatus socket workaround.

Another solution would be to properly handle the response from livestatus. The real problem is that, AFAIK, livestatus does not responds anything if a COMMAND successfully executed. It only responds something if an error occurred.

I'm quite curious to know if someone else has already triggered this behavior.

I'm also curious to know if you have any better idea about how to handle this in the go-livestatus library.

Thanks in advance

riton commented 2 years ago

I've quickly developed a version of the Command.handle() function that has a switch to read a response from livestatus. See https://github.com/ccin2p3/go-livestatus/blob/feature/cmd_read_result/command.go#L95

I may be missing something. Maybe that there is a magic flag / header to tell livestatus to ACK and write back a success when a command has successfully executed.

I've checked to seen how thruk (the WebUI and more) is handling this kind of thing. It seems that they are also doing a select() on the socket to see if something is available to read. Code is at https://github.com/sni/Thruk/blob/v2.46.3/lib/Monitoring/Livestatus.pm#L1155

The solution I've implemented is similar. I don't like this at all. This is like performing a sleep in a production code.

I'm absolutely opened to any other solution !

Note : If you'd prefer me to open a Pull Request to discuss about a possible solution, just ask.

Thanks.

riton commented 2 years ago

For information, I've tried to reproduce the problem with the livestatus.py python API using this example code.

I was able to reproduce the problem.

I'm starting to doubt that go-livestatus should be modified...

vbatoufflet commented 2 years ago

Hi @riton,

It's been a while since I haven't used go-livestatus on production systems.

IIRC, back in the day, I never hit this issue hence – sadly – I can't think of a specific solution to circumvent this at the moment.

You're right on the fact that having a sleep-like system to get around this is far from ideal. Maybe having a way to enable this behavior, being optional and disabled by default, could do the trick. But I haven't looked into details yet.

That being said, having a PR would be greatly appreciated!

vbatoufflet / go-livestatus