Closed Schnitzel closed 9 months ago
correction: when the controller is restarted and the pods are still existing from the previous run the task pod is not restarted, but instead the controller realizes that the pod is done and marks the task as succeeded. But again it only marks some of the tasks as done:
I've been trying to see how this occurs, but it always happens when I am asleep and I haven't been able to determine if the problem is with broker or if it is in the remote.
There are some fixes in main that I'm working through at the moment that might help reduce this issue though, will run through the test infra this week hopefully.
I haven't heard reports of this happening again, and as there have been a lot of improvements to the API and remote-controller in the last ~2 years I think this can be closed, if the issue presents itself we can re-open.
I'm currently battling an issue with an EKS-CN cluster which has a lot of tasks pending (controller version
uselagoon/remote-controller:v0.4.0
). It seems like that the controller has lost the connection to the rabbitmq and when I restarted it around 50 tasks had been started all at the same time.While all task pods where started it seems like the controller is loosing track of the tasks and even though the pods are completed the task is not marked as successful, see:
when I restart the controller pod the tasks that are still pending are restarted and eventually over time all tasks are succeeding, but the actual tasks are running multiple times