svensp / hcloud_ocf

Hetzner Cloud Pacemaker OCF Resource Agents, FloatingIp and STONITH device
MIT License
11 stars 4 forks source link

Error: not running(7) #4

Open rjams opened 5 years ago

rjams commented 5 years ago

The ocf-resouce is running fine. Sometimes for a week without any trouble. But suddenly this error occured.

Failed Actions:
* vip_failover_monitor_30000 on app1 'not running' (7): call=55, status=complete, exitreason='none',
    last-rc-change='Thu Mar 14 08:14:07 2019', queued=0ms, exec=0ms

When this error shows up it occure more than once. What is the problem. Hetzner said: everything runs without a problem.

svensp commented 5 years ago

not running is a valid response from the monitoring action of the agent. It means: The agent successfully retrieved the server from the api The agent successfully retrieved the floating ip from the api The server id in the floating ip did not match the servers id. So the api is telling the agent that the ip address is not assigned to the server it expects.

Errors while retrieving either of server or floating ip are handled with different errors. So the only way I can see the error lying in the agent is if it identified the wrong server as its representation in the api. Currently the agent iterates over the servers in the api and checks if their public ip address is present on the machine it is running on. I can't see a false positive happening here.

Since it only happens rarely and without discernable trigger I am currently guessing that it is in fact the hetzner api which is returning wrong data. I am open for different ideas thought.