uselagoon / remote-controller

A group of controllers for handling Lagoon builds and tasks in Kubernetes or Openshift
5 stars 1 forks source link

many tasks startd at the same time cause controller to miss task success #145

Closed Schnitzel closed 7 months ago

Schnitzel commented 2 years ago

I'm currently battling an issue with an EKS-CN cluster which has a lot of tasks pending (controller version uselagoon/remote-controller:v0.4.0). It seems like that the controller has lost the connection to the rabbitmq and when I restarted it around 50 tasks had been started all at the same time.

While all task pods where started it seems like the controller is loosing track of the tasks and even though the pods are completed the task is not marked as successful, see:

Screen Shot 2022-06-28 at 07 45 21 Screen Shot 2022-06-28 at 07 49 08

when I restart the controller pod the tasks that are still pending are restarted and eventually over time all tasks are succeeding, but the actual tasks are running multiple times

Schnitzel commented 2 years ago

correction: when the controller is restarted and the pods are still existing from the previous run the task pod is not restarted, but instead the controller realizes that the pod is done and marks the task as succeeded. But again it only marks some of the tasks as done:

Screen Shot 2022-06-28 at 08 07 59
shreddedbacon commented 2 years ago

I've been trying to see how this occurs, but it always happens when I am asleep and I haven't been able to determine if the problem is with broker or if it is in the remote.

There are some fixes in main that I'm working through at the moment that might help reduce this issue though, will run through the test infra this week hopefully.

shreddedbacon commented 7 months ago

I haven't heard reports of this happening again, and as there have been a lot of improvements to the API and remote-controller in the last ~2 years I think this can be closed, if the issue presents itself we can re-open.