need handling for instances in ERROR state

sardinasystems / fleeting-plugin-openstack

GitLab fleeting plugin for OpenStack

Apache License 2.0

5 stars 4 forks source link

need handling for instances in ERROR state #21

Closed davebiffuk closed 1 day ago

davebiffuk commented 1 week ago

In the Update routine, there is no explicit handling of instances which end up in the ERROR state. This situation can occur if an instance fails to launch, or fails due to a hypervisor fault: https://github.com/sardinasystems/fleeting-plugin-openstack/blob/88e02359c33bee0e2b3b3aae1254e19d79876d4c/provider.go#L111

As a result, ERROR instances can clutter up the OpenStack project, and confuse the autoscaling login. For example if there are two ERROR instances and the autoscaler is configured with idle_count = 2 then no further instances will be created, and no CI jobs will run.

It seems that reasonable self-healing behaviour would be to log a message about the ERROR instance, and then to attempt to delete the ERROR instance. Unfortunately I don't have the Go skills to suggest a patch.

vooon commented 1 week ago

Well, yes, error state confuses plugin. Earlier versions of autoscaler had some troubles if servers gone. Unsure if that's still a problem, and also that's from senlin-based one...

But autodeletion by the plugin itself may be unfeasible. Usually those error messages useful to fix your cluster, and you want to see them. No one will look into runner logs for a long time. And that more a job for monitoring system.

Personally i rarely had this error states (and if get them - usually all attempts will fail as well).

Also note - you can just restart runner, then it'll delete all old instances.

vooon commented 1 week ago

https://gitlab.com/gitlab-org/fleeting/fleeting/-/blob/main/provider/interface.go

Also i'm not sure what status to set for autoscaler...

davebiffuk commented 1 day ago

Thanks for the reply. Yes it's difficult to see what is best to do and I take your point about logs/monitoring. We only saw this issue during an OpenStack upgrade when the API was intermittently unavailable. I think we'll rig up something that checks and alerts for "ERROR" instances. So I'll close this issue now.

We've had mixed results with restarting the runner. It doesn't always realise that it should act on old instances. I couldn't see a pattern to that.