Closed davebiffuk closed 1 day ago
Well, yes, error state confuses plugin. Earlier versions of autoscaler had some troubles if servers gone. Unsure if that's still a problem, and also that's from senlin-based one...
But autodeletion by the plugin itself may be unfeasible. Usually those error messages useful to fix your cluster, and you want to see them. No one will look into runner logs for a long time. And that more a job for monitoring system.
Personally i rarely had this error states (and if get them - usually all attempts will fail as well).
Also note - you can just restart runner, then it'll delete all old instances.
https://gitlab.com/gitlab-org/fleeting/fleeting/-/blob/main/provider/interface.go
Also i'm not sure what status to set for autoscaler...
Thanks for the reply. Yes it's difficult to see what is best to do and I take your point about logs/monitoring. We only saw this issue during an OpenStack upgrade when the API was intermittently unavailable. I think we'll rig up something that checks and alerts for "ERROR" instances. So I'll close this issue now.
We've had mixed results with restarting the runner. It doesn't always realise that it should act on old instances. I couldn't see a pattern to that.
In the Update routine, there is no explicit handling of instances which end up in the ERROR state. This situation can occur if an instance fails to launch, or fails due to a hypervisor fault: https://github.com/sardinasystems/fleeting-plugin-openstack/blob/88e02359c33bee0e2b3b3aae1254e19d79876d4c/provider.go#L111
As a result, ERROR instances can clutter up the OpenStack project, and confuse the autoscaling login. For example if there are two ERROR instances and the autoscaler is configured with
idle_count = 2
then no further instances will be created, and no CI jobs will run.It seems that reasonable self-healing behaviour would be to log a message about the ERROR instance, and then to attempt to delete the ERROR instance. Unfortunately I don't have the Go skills to suggest a patch.