sni / Thruk

Thruk is a multibackend monitoring webinterface for Naemon, Nagios, Icinga and Shinken using the Livestatus API.
http://www.thruk.org
Other
408 stars 149 forks source link

Removing downtime via API /r/system/cmd/del_downtime_by_host_name fails to delete downtime #1212

Open savv3 opened 1 year ago

savv3 commented 1 year ago

Using REST API: /r/system/cmd/del_downtime_by_host_name sometimes fail to delete downtime even though API reports successful. Waiting a bit and resubmitting POST seems to work. But if you POST one del_downtime_by_host_name, and just after another one, delete fails.

Thruk version 2.50

Steps to reproduce the behavior:

  1. Add multiple servers to downtime
  2. Remove a server via a POST to REST API with endpoint /r/system/cmd/del_downtime_by_host_name with hostname as POST data
  3. Check Thruk webui. If server is removed, try with next server
  4. At some point Thruk API will fail to delete the server

Expected behavior Expected server with downtime to be removed from downtime

Additional context I've been testing the API with postman and ansible. And delete sometimes work and sometimes it doesn't. It seems to be related to time somehow. If I remove a server and it works, I'll try with the next one and if that fails, if I wait for x amount of seconds and try again, it usually works. Seems like there some processing going on internally that's not behaving correctly.

savv3 commented 5 months ago

I've had another look at this. I've tested with a list of servers: server[1-9].example.com

When running to endpoint /r/system/cmd/del_downtime_by_host_name, the first run removes every other server. Leaving a list of server[2,4,6,8].example.com. For each server that is not removed the following error appears in naemon.log: Error: External command failed -> DEL_DOWNTIME_BY_HOST_NAME;server2.example.com;;1717056046;TESTCOMMENT

Enabling the Naemon debug log does not provide any further details, but when a requests fail, it never shows up in the debug log