Closed ronny-stauffer closed 1 year ago
Hi. This is a much cleaner approach, but I applied the patch and found that there is still a problem where a job can hang.
Put Jenkins in shutdown mode Trigger a job. It goes pending Wait 5 minutes, then cancel the shutdown The job sits waiting for the host to be ready, but it is never created, so it is pending forever
Hi. This is a much cleaner approach, but I applied the patch and found that there is still a problem where a job can hang.
Put Jenkins in shutdown mode Trigger a job. It goes pending Wait 5 minutes, then cancel the shutdown The job sits waiting for the host to be ready, but it is never created, so it is pending forever
Thank you. And you're right: This PR only complements yours and doesn't replace it.
This fixes the problem of agents being removed too early by the Swarm Plugin due to a race condition which can happen in the DockerSwarmAgentRetentionStrategy class.
It can happen that an agent (Docker container) cannot start immediately after creation by the Swarm Plugin - possibly due to a lack of free resources in the Docker swarm - and therefore the agent comes online late, maybe one or several minutes after the initial "connection" to the DockerSwarmComputer (see hudson.model.Computer.getConnectTime() and connectTime variable in DockerSwarmAgentRetentionStrategy.check()). If the point of time the agent comes online falls together with a retention strategy run and Jenkins just didn't dispatch a build task to the agent yet, the condition in DockerSwarmAgentRetentionStrategy.check() (-> c.isOnline() && isTimeout && (!isTaskAccepted || isTaskCompleted) is met and the agent is inadvertently deleted by the retention strategy. If this happens, the agent container is removed from the Docker swarm and the assigned build task stays in the build queue forever.