[JENKINS-20967] Cloud provisioning called when Jenkins is quieting Down - Githubissues

timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-20967] Cloud provisioning called when Jenkins is quieting Down #10844

Open timja opened 10 years ago

timja commented 10 years ago

If Jenkins is quieting down and there are builds in the queue, nodes are still provisioned from any clouds.

Ideally, Jenkins would not provision new slaves when it is supposed to be quieting down.

Originally reported by recampbell, imported from: Cloud provisioning called when Jenkins is quieting Down

status: Open
priority: Major
resolution: Unresolved
imported: 2022/01/10

timja commented 10 years ago

It seems that the root cause is that each label's LoadStatistics are still reporting queue lengths over zero.

So a simple fix would just have hudson.slaves.NodeProvisioner#update not provision if Jenkins.isQuietingDown().

timja commented 9 years ago

thomassuckow:

Changing NodeProvisioner would create a deadlock situation with NonBlockingTasks, such as a Matrix Build. Their slaves may never get created. I think it would be more appropriate to modify the behaviour of countBuildable*() in Queue to only count tasks that are not blocked by shutdown.

I have another pull request manipulating countBuildable, I may make a pull request for this after that one gets accepted.

timja commented 9 years ago

Better wait for https://github.com/jenkinsci/jenkins/pull/1596.

timja commented 9 years ago

Any news?

timja commented 8 years ago

Highlight.

timja commented 8 years ago

thomassuckow:

For anyone interested. I had started work on straightening out countBuildable* but a conflicting change made mine unmergable. I don't have the time to look into this in the near future, but my work is still at https://github.com/thomassuckow/jenkins/commits/feature/fix-stuck-queue

timja commented 5 years ago

stephenconnolly:

Removing myself as assignee. My current work assignments do not provide sufficient bandwidth to review these issues and in the majority of cases I am only assigned by virtue of being the default assignee. For the credentials-api and scm-api related plugins I have permission to allocate time reviewing changes to these APIs themselves to ensure these APIs remain cohesive, but that can be handled through PR reviews rather than assigning issues in JIRA

timja commented 5 years ago

I've just hit this issue in my own working environment... but it's fortunate that I found this issue report as I was thinking of coding a workaround as described in Ryan's initial comment as I hadn't considered Thomas' concerns...

SitRep:
So, back in 2015, Jesse said to wait for PR 1596 - that was merged in early 2016.
Thomas's PR is still readable, but it was closed due to inactivity early this year (2018).
Looking at the history for NodeProvisioner, Stephen wrote most of it - kinda ironic that Stephen un-assigned this only a week ago :-/

TL;DR: That PR needs a lot of tidying up to extract the core intended changes, followed by a review by folks who know this code.

timja commented 5 years ago

The PR being linked to is for JENKINS-27034, which sounds unrelated. I think thomassuckow was merely saying that the fixes for both would touch similar areas of code, so he wanted to serialize them. If there is a PR open for this issue, it is not mentioned here.

I would not be inclined to waste much more time on Queue + Cloud + NodeProvisioner when there is a more straightforward way of provisioning a “one-shot” agent on demand for a particular build, exemplified by the dockerNode step in docker-plugin.

timja commented 5 years ago

From what I've read, it's the incorrect counting of the runnable workload that's causing this issue - it may well be that the fix for JENKINS-27034 will help fix this issue (or perhaps even fix this problem entirely).
i.e. This issue may just be a symptom of JENKINS-27034.

Also, I would not consider time spent fixing Queue/Cloud/NodeProvisioner as time wasted - that's all core cloud functionality that's used to provide executors by all cloud plugins (e.g. we use docker, vSphere and OpenStack; there are others).

I appreciate that dockerNode is useful, but pipeline-specified one-shot nodes aren't the answer to everything. When it takes a long time for a node to start up (e.g. fully featured VMs rather than lightweight containers), it's important to have clouds configured to supply nodes (with a retention strategy that is not "one shot") in order to maintain build throughput.

FYI I didn't encounter this issue via the docker-plugin; I noticed this because the Jenkins core was asking the vsphere-plugin for new nodes (where dockerNode isn't a viable replacement) and I was monitoring my vSphere cloud at the time. There may well have been OpenStack and Docker nodes being created as well (but I wasn't monitoring those at the time).

timja commented 5 years ago

This issue may just be a symptom of JENKINS-27034.

Might be. A functional test ought to be able to find out.

it's important to have clouds configured to supply nodes (with a retention strategy that is not "one shot") in order to maintain build throughput

Well, there is nothing stopping an implementation from keeping a pool of booted and warm VMs ready for use. But yes this was off-topic.

timja commented 2 years ago

[Originally related to: JENKINS-27565]