Open riton opened 2 years ago
I've quickly developed a version of the Command.handle()
function that has a switch to read a response from livestatus
. See https://github.com/ccin2p3/go-livestatus/blob/feature/cmd_read_result/command.go#L95
I may be missing something. Maybe that there is a magic flag / header to tell livestatus
to ACK
and write back a success when a command has successfully executed.
I've checked to seen how thruk
(the WebUI and more) is handling this kind of thing. It seems that they are also doing a select()
on the socket
to see if something is available to read
. Code is at https://github.com/sni/Thruk/blob/v2.46.3/lib/Monitoring/Livestatus.pm#L1155
The solution I've implemented is similar. I don't like this at all. This is like performing a sleep
in a production code.
I'm absolutely opened to any other solution !
Note : If you'd prefer me to open a Pull Request to discuss about a possible solution, just ask.
Thanks.
For information, I've tried to reproduce the problem with the livestatus.py
python API using this example code.
I was able to reproduce the problem.
I'm starting to doubt that go-livestatus
should be modified...
Hi @riton,
It's been a while since I haven't used go-livestatus
on production systems.
IIRC, back in the day, I never hit this issue hence – sadly – I can't think of a specific solution to circumvent this at the moment.
You're right on the fact that having a sleep-like system to get around this is far from ideal. Maybe having a way to enable this behavior, being optional and disabled by default, could do the trick. But I haven't looked into details yet.
That being said, having a PR would be greatly appreciated!
Description
We're using
go-livestatus
to submit a huge amount of passive checks results tonaemon
through thelivestatus
unix socket. Very important detail: we're opening only one connection to submit 900+ check results.Many of our probes submit check results for production and test clusters. Sometimes the host and services exist in the
naemon
configuration. Sometimes it doesn't. This is no real problem fornaemon
. It simply emits a warning in its logs and ignores the event.Everything was working fine, until one of our probe started to submit a very huge amount of check results for services that do not exist in
naemon
usinglivestatus
. And then, we saw thelivestatus
connection blocking after ~500 results and no more check results could be submitted through the connection.What we think happened
After a little analysis, we were able to identify the problem. On successful commands,
livestatus
does NOT responds anything. On failed commands,livestatus
responds with the error.The current
go-livestatus
code does not reads a command response fromlivestatus
. We think that after a certain amount of failed commands on the same connection,livestatus
blocks if nothing reads its responses. I didn't dig into thelivestatus
code so I'm not a 100% sure of the exact reason here.Can we reproduce the problem
Yes, the following code can reproduce the problem:
This program submits 900 check results to
naemon
for ahost
andservice
that does not exist in thenaemon
configuration.livestatus
will then return an error.Software versions
naemon
serverGo client
Known workarounds
A simple workaround is to
Close()
thelivestatus
connection after a small number of check results submitted and open a new one to continue This creates other problem (like thelivestatus
socket being temporary unavailable) when a burst of check results occurs. But this workaround is available right now and easy to use.Discussion
We're currently using the close and reopen the
livestatus
socket workaround.Another solution would be to properly handle the response from
livestatus
. The real problem is that, AFAIK,livestatus
does not responds anything if aCOMMAND
successfully executed. It only responds something if an error occurred.I'm quite curious to know if someone else has already triggered this behavior.
I'm also curious to know if you have any better idea about how to handle this in the
go-livestatus
library.Thanks in advance