jenkinsci / docker-swarm-plugin

Jenkins plugin which allows to add a Docker Swarm as a cloud agent provider
https://plugins.jenkins.io/docker-swarm/
MIT License
55 stars 47 forks source link

Prevent premature removal of agent because of race condition in DockerSwarmAgentRetentionStrategy #115

Closed ronny-stauffer closed 1 year ago

ronny-stauffer commented 2 years ago

This fixes the problem of agents being removed too early by the Swarm Plugin due to a race condition which can happen in the DockerSwarmAgentRetentionStrategy class.

It can happen that an agent (Docker container) cannot start immediately after creation by the Swarm Plugin - possibly due to a lack of free resources in the Docker swarm - and therefore the agent comes online late, maybe one or several minutes after the initial "connection" to the DockerSwarmComputer (see hudson.model.Computer.getConnectTime() and connectTime variable in DockerSwarmAgentRetentionStrategy.check()). If the point of time the agent comes online falls together with a retention strategy run and Jenkins just didn't dispatch a build task to the agent yet, the condition in DockerSwarmAgentRetentionStrategy.check() (-> c.isOnline() && isTimeout && (!isTaskAccepted || isTaskCompleted) is met and the agent is inadvertently deleted by the retention strategy. If this happens, the agent container is removed from the Docker swarm and the assigned build task stays in the build queue forever.

David-Villeneuve commented 2 years ago

Hi. This is a much cleaner approach, but I applied the patch and found that there is still a problem where a job can hang.

Put Jenkins in shutdown mode Trigger a job. It goes pending Wait 5 minutes, then cancel the shutdown The job sits waiting for the host to be ready, but it is never created, so it is pending forever

ronny-stauffer commented 2 years ago

Hi. This is a much cleaner approach, but I applied the patch and found that there is still a problem where a job can hang.

Put Jenkins in shutdown mode Trigger a job. It goes pending Wait 5 minutes, then cancel the shutdown The job sits waiting for the host to be ready, but it is never created, so it is pending forever

Thank you. And you're right: This PR only complements yours and doesn't replace it.