Open Gitaffe opened 8 months ago
I'm not able either, sorry. but some more details needed to fully reproduce.
this maybe helps to reproduce in a test case. Please try to find the minimal settings.
Did you see #15 ? I disabled a lot of test to get it working with newer jenkins boms.
So maybe we need to fix the tests or code to get the right behavior.
I also think jenkins is trying to use the agent from last run first before it is asking the loadbalancer. So as long as there is a free slot on the agent from previous run, it will run there again?
Hi there, sorry was out of the country. On the last comment, no it's not using the agent from the last run. In this particular setup it will always use the preferred node until it's coincident with another job on the timer in which case it doesn't even see the preferred node as being available.
Checking right now, for example, it manifests where every fifth run it will be on the undesirable, lower-rank node because this job is on a 2 minute timer and another is on a 5 minute timer. So 2 * 5 means every ten minutes this floats onto this agent (meaning that agent is perpetually kept alive during the week unless we set an undesirably low idle time since we use on-demand agents).
In answer to the questions, we have a rich setup so I'm not sure this would be that useful. The simplest case is simply with:
However to answer in case something is relevant:
In this case 8. But what is relevant is how many are running. If there's no agent running, only the controller (i.e. one agent, with more than one executor), it works fine - the jobs startup on the controller (in parallel too). If there's more than one agent- where the scores are such that the controller should always be favoured, as in the description above, it will always exclude the first-chosen node from the list evaluated for the second job.
Six executors on each is what we run, but anything more than 1 produces the behaviour from our testing (obviously with 1 it CAN'T run in parallel on the same node, so doesn't seem to be relevant). Labels, we have a lot, in this case it's monitoring
and there are four nodes labelled like that, but the problem we're trying to solve is when there are two active nodes (a controller and agent) with the label that are running.
The controller also has the labels controller
, monitoring-preferred
and long-running
, which are also used in job-level preference scores, but the non-controller agents don't have this.
The System configuration has these rules:
Scoring by Node Preferences Scale for Node Preferences: 10 Scale for Project Preferences: 20
Scoring by Node Loads Scale for Scores: 10 Score for Idle Executor: 4 Score for Busy Executor: -4
However in my experience changing these values doesn't have an effect because - as shown in the original report - even if we artificially massively increase the score for built-in it's simply not considered for where the second job can run.
monitoring
label. During the week there are many other jobs causing this contention - but this also happens on the weekend when only these two jobs are regular for many hours, and as soon as anything else provokes demand to startup a non-controller agent, it will start using it every ten minutes when the job schedules clash.I'm sorry I can't immediately identify anything in that disabled tests ticket
2 executors on master and "jenkinsagent" each. Both jobs (TestJob1 & TestJob2) start at the same time (by cron/timer). I added a log statement at https://github.com/jenkinsci/scoring-load-balancer-plugin/blob/1ee4e030f17d4eb333e5ce0a1b92c1ed9ea4f2c8/src/main/java/jp/ikedam/jenkins/plugins/scoringloadbalancer/ScoringLoadBalancer.java#L147 at the entrypoint(?) of this plugin.
Jun 21, 2024 2:20:00 PM INFO jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer map
map - Task: hudson.model.FreeStyleProject@77540832[TestJob1], worksheet.executors: [[JobOffer[ #0], JobOffer[ #1]], [JobOffer[jenkinsagent #0], JobOffer[jenkinsagent #1]]]
Jun 21, 2024 2:20:00 PM INFO jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores
Scoring for hudson.model.FreeStyleProject@77540832[TestJob1]:
jenkinsagent: 1000
: -500
Jun 21, 2024 2:20:00 PM INFO jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer map
map - Task: hudson.model.FreeStyleProject@71823f88[TestJob2], worksheet.executors: [[JobOffer[ #0], JobOffer[ #1]]]
Jun 21, 2024 2:20:00 PM INFO jp.ikedam.jenkins.plugins.scoringloadbalancer.ScoringLoadBalancer reportScores
Scoring for hudson.model.FreeStyleProject@71823f88[TestJob2]:
: -500
It seems that the node is already missing in the data transmitted by Jenkins.
Jun 21, 2024 3:15:00 PM FINEST hudson.model.Queue
JobOffer[ #1] is a potential candidate for task TestJob2
Jun 21, 2024 3:15:00 PM FINEST hudson.model.Queue
JobOffer[jenkinsagent #0] rejected TestJob2: Executor slot already in use
Jun 21, 2024 3:15:00 PM FINEST hudson.model.Queue
JobOffer[ #0] is a potential candidate for task TestJob2
Jun 21, 2024 3:15:00 PM FINEST hudson.model.Queue
JobOffer[jenkinsagent #1] rejected TestJob2: Executor slot already in use
https://github.com/jenkinsci/jenkins/blob/252b7dce8a96257c28e8946b338efc78bb489f6b/core/src/main/java/hudson/model/Queue.java#L291-L292 ~(I'm not sure if this really is the source of this error message, but it very much looks like it.)~ Edit: confirmed.
I found the issue: It's the cache in https://github.com/jenkinsci/jenkins/blob/252b7dce8a96257c28e8946b338efc78bb489f6b/core/src/main/java/hudson/model/Queue.java#L1646-L1653 that was introduced for JENKINS-58101.
parked.values() contains all executors from all nodes, so in my example above 4 elements. After TestJob1 is succesfully bound to eg. jenkinsagent Executor#0, there are 3 executors left. When evaluating the executors that are available for TestJob2 (parked.values() still contains the same 4 elements), jenkinsagent Executor#0 is rejected (because it was just occupied by TestJob1) and this rejection is cached for the whole node. When evaluating jenkinsagent Executor#1 the rejection is loaded from the cache, despite the executor beeing available.
This issue only occurs for simultaneously started builds, because for builds that are invoked at a different point in time parked.values() only contains executors that are available:
Thanks! That sounds exactly right and matching my observations.
I can't comment on the correctness of the workaround as I'm not familiar with the code (I presume if this is working the null return value essentially throttles processing of the queue and they're just delayed?) but hope this is considered for merge!
Since this really is a hack that comes with it's consequences, like builds beeing delayed for some seconds, and relies on things like system time and magic numbers as well, I'm unsure if this really should be merged as is.
can you open a draft PR, so we can work on it together?
Jenkins and plugins versions report
Environment
```text Jenkins: 2.440.2 OS: Linux - 5.14.0-362.8.1.el9_3.x86_64 Java: 21.0.2 - Eclipse Adoptium (OpenJDK 64-Bit Server VM) --- PrioritySorter:5.1.0 ace-editor:1.1 adoptopenjdk:1.5 analysis-model-api:12.1.0 ansible:307.va_1f3ef06575a_ ansicolor:1.0.4 ant:497.v94e7d9fffa_b_9 antisamy-markup-formatter:162.v0e6ec0fcfcf6 apache-httpcomponents-client-4-api:4.5.14-208.v438351942757 apache-httpcomponents-client-5-api:5.3.1-1.0 asm-api:9.6-3.v2e1fa_b_338cd7 authentication-tokens:1.53.v1c90fd9191a_b_ aws-credentials:231.v08a_59f17d742 aws-java-sdk:1.12.671-445.ve02f9b_558f2e aws-java-sdk-api-gateway:1.12.671-445.ve02f9b_558f2e aws-java-sdk-autoscaling:1.12.671-445.ve02f9b_558f2e aws-java-sdk-cloudformation:1.12.671-445.ve02f9b_558f2e aws-java-sdk-cloudfront:1.12.671-445.ve02f9b_558f2e aws-java-sdk-codebuild:1.12.671-445.ve02f9b_558f2e aws-java-sdk-codedeploy:1.12.671-445.ve02f9b_558f2e aws-java-sdk-ec2:1.12.671-445.ve02f9b_558f2e aws-java-sdk-ecr:1.12.671-445.ve02f9b_558f2e aws-java-sdk-ecs:1.12.671-445.ve02f9b_558f2e aws-java-sdk-efs:1.12.671-445.ve02f9b_558f2e aws-java-sdk-elasticbeanstalk:1.12.671-445.ve02f9b_558f2e aws-java-sdk-elasticloadbalancingv2:1.12.671-445.ve02f9b_558f2e aws-java-sdk-iam:1.12.671-445.ve02f9b_558f2e aws-java-sdk-kinesis:1.12.671-445.ve02f9b_558f2e aws-java-sdk-lambda:1.12.671-445.ve02f9b_558f2e aws-java-sdk-logs:1.12.671-445.ve02f9b_558f2e aws-java-sdk-minimal:1.12.671-445.ve02f9b_558f2e aws-java-sdk-organizations:1.12.671-445.ve02f9b_558f2e aws-java-sdk-secretsmanager:1.12.671-445.ve02f9b_558f2e aws-java-sdk-sns:1.12.671-445.ve02f9b_558f2e aws-java-sdk-sqs:1.12.671-445.ve02f9b_558f2e aws-java-sdk-ssm:1.12.671-445.ve02f9b_558f2e badge:1.9.1 bootstrap4-api:4.6.0-6 bootstrap5-api:5.3.3-1 bouncycastle-api:2.30.1.77-225.v26ea_c9455fd9 branch-api:2.1152.v6f101e97dd77 build-blocker-plugin:1.7.9 build-failure-analyzer:2.5.0 build-name-setter:2.4.2 build-timeout:1.32 build-timestamp:1.0.3 build-user-vars-plugin:1.9 build-with-parameters:76.v9382db_f78962 caffeine-api:3.1.8-133.v17b_1ff2e0599 changelog-history:1.7 checks-api:2.0.2 claim:543.v0061de6b_dcb_d cloud-stats:336.v788e4055508b_ cloudbees-folder:6.901.vb_4c7a_da_75da_3 command-launcher:107.v773860566e2e commons-lang3-api:3.13.0-62.v7d18e55f51e2 commons-text-api:1.11.0-95.v22a_d30ee5d36 compact-columns:1.185.vf3851b_4d31fe conditional-buildstep:1.4.3 config-file-provider:968.ve1ca_eb_913f8c configurationslicing:548.ve92d48e66b_f8 copy-project-link:106.veb_028794a_844 copyartifact:722.v0662a_9b_e22a_c credentials:1337.v60b_d7b_c7b_c9f credentials-binding:657.v2b_19db_7d6e6d custom-tools-plugin:0.8 dashboard-view:2.508.va_74654f026d1 data-tables-api:2.0.2-1 delivery-pipeline-plugin:1.4.2 dependency-check-jenkins-plugin:5.5.0 dependency-track:4.3.1 dependencyanalyzer:0.7 depgraph-view:1.0.5 description-setter:239.vd0a_6b_785f92d disk-usage:1.2 display-url-api:2.200.vb_9327d658781 docker-commons:439.va_3cb_0a_6a_fb_29 docker-java-api:3.3.4-86.v39b_a_5ede342c docker-plugin:1.6 docker-workflow:572.v950f58993843 durable-task:550.v0930093c4b_a_6 ec2:1648.vf3d852e00486 echarts-api:5.5.0-1 email-ext:2.105 email-ext-recipients-column:27.vb_9404db_b_018d emailext-template:1.5 enhanced-old-build-discarder:1.4 envinject:2.908.v66a_774b_31d93 envinject-api:1.199.v3ce31253ed13 extended-choice-parameter:381.v360a_25ea_017c extended-read-permission:53.v6499940139e5 external-monitor-job:215.v2e88e894db_f8 extra-columns:1.26 extra-tool-installers:139.v723fee51b_7f2 font-awesome-api:6.5.1-3 forensics-api:2.4.0 generic-webhook-trigger:2.1.0 git:5.2.1 git-client:4.7.0 git-parameter:0.9.19 git-server:114.v068a_c7cc2574 gitea:1.4.7 github:1.38.0 github-api:1.318-461.v7a_c09c9fa_d63 github-branch-source:1785.v99802b_69816c groovy:457.v99900cb_85593 groovy-postbuild:228.vcdb_cf7265066 gson-api:2.10.1-15.v0d99f670e0a_7 h2-api:11.1.4.199-12.v9f4244395f7a_ handlebars:3.0.8 handy-uri-templates-2-api:2.1.8-30.v7e777411b_148 heavy-job:1.1 htmlpublisher:1.33 http_request:1.18 instance-identity:185.v303dc7c645f9 ionicons-api:56.v1b_1c8c49374e jackson2-api:2.17.0-379.v02de8ec9f64c jakarta-activation-api:2.1.3-1 jakarta-mail-api:2.1.3-1 javadoc:243.vb_b_503b_b_45537 javax-activation-api:1.2.0-6 javax-mail-api:1.6.2-9 jaxb:2.3.9-1 jdk-tool:73.vddf737284550 jjwt-api:0.11.5-112.ve82dfb_224b_a_d job-dsl:1.87 job-node-stalker:1.0.5 job-restrictions:0.8 jobConfigHistory:1229.v3039470161a_d joda-time-api:2.12.7-29.v5a_b_e3a_82269a_ join:1.21 jquery:1.12.4-1 jquery-detached:1.2.1 jquery3-api:3.7.1-2 jsch:0.2.16-86.v42e010d9484b_ json-api:20240303-41.v94e11e6de726 json-path-api:2.9.0-58.v62e3e85b_a_655 junit:1265.v65b_14fa_f12f0 ldap:719.vcb_d039b_77d0d lockable-resources:1246.v28b_e4cc6fa_16 log-parser:2.3.3 mailer:472.vf7c289a_4b_420 mapdb-api:1.0.9-40.v58107308b_7a_7 matrix-auth:3.2.2 matrix-project:822.824.v14451b_c0fd42 maven-plugin:3.23 metrics:4.2.21-449.v6960d7c54c69 mina-sshd-api-common:2.12.1-101.v85b_e08b_780dd mina-sshd-api-core:2.12.1-101.v85b_e08b_780dd momentjs:1.1.1 multiple-scms:0.8 naginator:1.436.vb_e769dcb_cdf6 next-executions:327.v136ff959e97b_ node-iterator-api:55.v3b_77d4032326 nodelabelparameter:1.12.0 okhttp-api:4.11.0-172.vda_da_1feeb_c6e pam-auth:1.10 parameterized-scheduler:262.v00f3d90585cc parameterized-trigger:787.v665fcf2a_830b_ percentage-du-node-column:0.1.0 periodic-reincarnation:1.13 pipeline-aggregator-view:104.v94a_e5f6cdb_c3 pipeline-build-step:540.vb_e8849e1a_b_d8 pipeline-github-lib:42.v0739460cda_c4 pipeline-graph-analysis:216.vfd8b_ece330ca_ pipeline-groovy-lib:704.vc58b_8890a_384 pipeline-input-step:491.vb_07d21da_1a_fb_ pipeline-maven:1396.veb_f07b_2fc1d8 pipeline-maven-api:1396.veb_f07b_2fc1d8 pipeline-milestone-step:119.vdfdc43fc3b_9a_ pipeline-model-api:2.2184.v0b_358b_953e69 pipeline-model-definition:2.2184.v0b_358b_953e69 pipeline-model-extensions:2.2184.v0b_358b_953e69 pipeline-rest-api:2.34 pipeline-stage-step:312.v8cd10304c27a_ pipeline-stage-tags-metadata:2.2184.v0b_358b_953e69 pipeline-stage-view:2.34 pipeline-utility-steps:2.16.2 plain-credentials:179.vc5cb_98f6db_38 plugin-util-api:4.1.0 popper-api:1.16.1-3 popper2-api:2.11.6-4 powershell:2.1 prism-api:1.29.0-13 promoted-builds:945.v597f5c6a_d3fd publish-over:0.22 publish-over-ssh:1.25 purge-job-history:1.6 rebuild:330.v645b_7df10e2a_ redmine:0.21 resource-disposer:0.23 role-strategy:713.vb_3837801b_8cc run-condition:1.7 saml:4.464.vea_cb_75d7f5e0 scm-api:689.v237b_6d3a_ef7f scoring-load-balancer:81.vb_82d531ff049 script-security:1326.vdb_c154de8669 scriptler:348.v5d461e205da_a_ secondary-timestamper-plugin:1.1 simple-theme-plugin:176.v39740c03a_a_f5 sitemonitor:0.6 slack:684.v833089650554 slave-setup:1.16 snakeyaml-api:2.2-111.vc6598e30cc65 sonar:2.17.2 ssh:2.6.1 ssh-agent:346.vda_a_c4f2c8e50 ssh-credentials:334.v7732563deee1 ssh-slaves:2.948.vb_8050d697fec ssh-steps:2.0.68.va_d21a_12a_6476 ssh2easy:1.6 sshd:3.322.v159e91f6a_550 structs:337.v1b_04ea_4df7c8 subversion:1256.vee91953217b_6 thinBackup:1.19 throttle-concurrents:2.14 timestamper:1.26 token-macro:400.v35420b_922dcb_ translation:1.16 trilead-api:2.142.v748523a_76693 unicorn:0.1.1 uno-choice:2.8.3 validating-string-parameter:183.v3748e79b_9737 variables-replace-plugin:1.0.0 variant:60.v7290fc0eb_b_cd view-job-filters:369.ve0513a_a_f5524 warnings-ng:11.2.2 windows-slaves:1.8.1 workflow-aggregator:596.v8c21c963d92d workflow-api:1291.v51fd2a_625da_7 workflow-basic-steps:1049.v257a_e6b_30fb_d workflow-cps:3889.v937e0b_3412d3 workflow-durable-task-step:1331.vc8c2fed35334 workflow-job:1400.v7fd111b_ec82f workflow-multibranch:773.vc4fe1378f1d5 workflow-scm-step:427.v4ca_6512e7df1 workflow-step-api:657.v03b_e8115821b_ workflow-support:896.v175a_a_9c5b_78f ws-cleanup:0.45 zapper:1.0.7 zentimestamp:4.2 ```What Operating System are you using (both controller, and any agents involved in the problem)?
Linux (Fedora) but running in the ubi9 Docker container from the official DockerHub image
Reproduction steps
Expected Results
I expect all jobs to run on the highly-preferred node.
Actual Results
Whichever job is parsed second NEVER has available to it the node selected for the first job.
For example, in the below there are 3 agents. Built-in (which appears with no label), and two other agents. We're highly favouring one node for example purposes, but it happens at any scale of preferences.
When all the jobs have the same set of permissible labels we see the below. Note that
lnx-aarch64-agent-23a:
does NOT appear for the second jobbbb
. But it is available in the third and fourth jobs.This seems consistent no matter which nodes we set to be preferred, what the set of running agents is, and what the set of labels are.
The only variation I've noticed is that when the first job has a different permissible label set (i.e. just the built-in controller), then this node is not available to the second job, but then the node the second job selects is not available to the third job either, but they all come back in the fourth job's set.
Anything else?
I've gone through the code here and some of the Jenkins core code at https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/queue/MappingWorksheet.java however I'm not very familiar with it and can't see an obvious reason for why the executors would not be available, nor even if it's a Jenkins core or plugin problem.
Without the scoring plugin enabled, the lack of granular logging and the default load balancing mean I can't coerce it to replicate the issue, so if it is the way Jenkins core is run, then it's maybe hidden from making any difference in anything but this plugin. However then one would hope that it would be possible to work around the behaviour to be consistent in this plugin.
Are you interested in contributing a fix?
Unfortunately I've spent some time looking but I don't understand the expected interaction with Jenkins well enough to propose a fix