Prevent premature removal of agent because of race condition in DockerSwarmAgentRetentionStrategy

ronny-stauffer commented 2 years ago

This fixes the problem of agents being removed too early by the Swarm Plugin due to a race condition which can happen in the DockerSwarmAgentRetentionStrategy class.

It can happen that an agent (Docker container) cannot start immediately after creation by the Swarm Plugin - possibly due to a lack of free resources in the Docker swarm - and therefore the agent comes online late, maybe one or several minutes after the initial "connection" to the DockerSwarmComputer (see hudson.model.Computer.getConnectTime() and connectTime variable in DockerSwarmAgentRetentionStrategy.check()). If the point of time the agent comes online falls together with a retention strategy run and Jenkins just didn't dispatch a build task to the agent yet, the condition in DockerSwarmAgentRetentionStrategy.check() (-> c.isOnline() && isTimeout && (!isTaskAccepted || isTaskCompleted) is met and the agent is inadvertently deleted by the retention strategy. If this happens, the agent container is removed from the Docker swarm and the assigned build task stays in the build queue forever.

[x] Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
[x] Ensure that the pull request title represents the desired changelog entry
[x] Please describe what you did
[x] Link to relevant issues in GitHub or Jira
[x] Link to relevant pull requests, esp. upstream and downstream changes
[x] Ensure you have provided tests - that demonstrates feature works or fixes the issue

David-Villeneuve commented 2 years ago

Hi. This is a much cleaner approach, but I applied the patch and found that there is still a problem where a job can hang.

Put Jenkins in shutdown mode Trigger a job. It goes pending Wait 5 minutes, then cancel the shutdown The job sits waiting for the host to be ready, but it is never created, so it is pending forever

ronny-stauffer commented 2 years ago

Hi. This is a much cleaner approach, but I applied the patch and found that there is still a problem where a job can hang.

Put Jenkins in shutdown mode Trigger a job. It goes pending Wait 5 minutes, then cancel the shutdown The job sits waiting for the host to be ready, but it is never created, so it is pending forever

Thank you. And you're right: This PR only complements yours and doesn't replace it.

jenkinsci / docker-swarm-plugin

Prevent premature removal of agent because of race condition in DockerSwarmAgentRetentionStrategy #115