Closed jetersen closed 2 years ago
Another build with agent being removed: https://ci.jenkins.io/job/Tools/job/bom/job/master/1075/
[2022-07-04T21:41:11.039Z] Cannot contact jnlp-maven-11-c4hfh: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
[2022-07-04T21:46:14.791Z] Could not connect to jnlp-maven-11-c4hfh to send interrupt signal to process
This is really troublesome for longer builds such as Jenkins, ATH, BOM or git-plugin. If agents being removed breaks the build.
What plugin is it that says build should fail if agent is removed. Why not retry the steps if agent is removed with a new agent.
@jglick do you think there is something we could improve in BOM build pipeline to retry build if x failure condition is met? Such as agent removed than retry the plugin test?
Yet Another: https://ci.jenkins.io/job/Tools/job/bom/job/master/1076/ Yet Another: https://ci.jenkins.io/job/Tools/job/bom/job/master/1077/
The agent availability check job runs every 4 hours to check that ci.jenkins.io agents can be allocated. It has been failing much more frequently in the last few days.
Hello @jetersen , thanks for reporting.
We have different (parallel) issues on ci.jenkins.io that make it hard to tackle.
However in the job you reported, the common denominator are they are all BOM builds.
This job is a big consumer of executors on Kubernetes agents: ~180 per build, while we only provide ~150 pods simultaneously. It creates pressure but still the way ci.jenkins.io behaves is weird.
@jglick is working on an improvement for the retry
instruction that could help the builds to be automatically re-triggered with such errors: tracking in https://github.com/jenkins-infra/helpdesk/issues/2984. This one should help to make this "less irritant".
Also, we have @lemeurherve whom is working in increasing the partneship with DigitalOcean so we could have more compute capacity.
The agent availability check job runs every 4 hours to check that ci.jenkins.io agents can be allocated. It has been failing much more frequently in the last few days.
After checking the build history of the acceptance job, I confirm that it is another kind of failures: failures are all about VM agents that cannot be started because of a quota of public IP in Azure. Work in progress on this.
something we could improve in BOM build pipeline to retry build
Note there is already a crude check: https://github.com/jenkinsci/bom/blob/c2b4fb2fe2690cb8abc160f774ba71cb1a5efecb/Jenkinsfile#L51-L52
Note there is already a crude check: https://github.com/jenkinsci/bom/blob/c2b4fb2fe2690cb8abc160f774ba71cb1a5efecb/Jenkinsfile#L51-L52
That retry does not work for agent removed as far as I can see. The pipeline basically goes to a halt.
Yup, looking at the consoleText for https://ci.jenkins.io/job/Tools/job/bom/job/master/1075/ I only see Attempt 1 of 2 echos. No Attempt 2 of 2.
Hmm, it should work except in cases where the controller was restarted in the middle. I think the problem is that FlowInterruptedException.actualInterruption
is getting defaulted to true
. Something else to fix.
We suspect agents on spot instances being killed as AWS requested them back.
We switched from "on demand" to "spot" EC2 highmem instances to reduce infra bugdet from 12k€ to 9k€ per month. We cannot increase 10k€ so maybe we should stop using EC2 for ATH.
As noted elsewhere by @jtnord, the most time we are guaranteed to have a spot instance is 2 minutes
spot instances can be terminated whenever Amazon feels like they will make more money by giving the underlying hardware to someone not using spot (ie there is not spare capacity). you get 2 minutes notice of this - so you know the agent (host) will always be around for 2 minutes. so yes every minute you use more than 2 minutes is a chance that the host will be reclaimed. that %'age is not fixed - it varies at certain times of the day due to demand :slightly_smiling_face: But for arguments sake say it was fixed (at least within say a 2 hour period) if the chance it is reclaimed in any minute is say 0.01% after 62 minutes the chance the instance is still alive is only 55% so the longer your task is the less you should actually use spot
@lemeurherve the ec2 template when using spot instances could use larger set of instancePools? Also potentially the spot instance template could default to onDemand if no spot instances are available?
We have been seeing instability in the acceptance-test-harness jobs presumably because of the spot reclamation.
I say presumably as I have no access to AWS to actually tell if this is (or is not the case). see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-interrupted-Spot-Instance.html for how to see spot instance reclamation.
Whilst we now have the retry
from @jglick and this does at least seem to be better for long running jobs it may not be the best thing (a branch takes approx 40 minutes in ATH - so we may be better off running more branches - subject to the limits mentioned above or not using spot here at all but on demand).
Additionally I think the ec2 plugin (I assume we are using that for spot instnace) should really note in a build log that it is being terminated so you know why the agent has gone
@lemeurherve the ec2 template when using spot instances could use larger set of instancePools?
For the EKS cluster that provides container agents for ci.jenkins.io (=> BOM builds for instance) it's already the case yes. For the EC2 VM agents (type highmem, used by ATH for instance), we don't know if it is possible: currently checking the EC2 plugin used for that.
Also potentially the spot instance template could default to onDemand if no spot instances are available?
That is the default behavior of what we configured for the EC2 VM agents yep, good tip!
I say presumably as I have no access to AWS to actually tell if this is (or is not the case). see docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-interrupted-Spot-Instance.html for how to see spot instance reclamation.
From https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-request-status.html
@lemeurherve the ec2 template when using spot instances could use larger set of instancePools?
For the EKS cluster that provides container agents for ci.jenkins.io (=> BOM builds for instance) it's already the case yes. For the EC2 VM agents (type highmem, used by ATH for instance), we don't know if it is possible: currently checking the EC2 plugin used for that.
@jetersen we've checked the ec2 plugin config on ci.jenkins.io and in its doc but we didn't found a way to configure an instance pool for it.
@res0nance would know better about the EC2 plugin as he is one of the maintainers.
Perhaps instance type accepts comma separation?
So close, yet no cigar: https://github.com/jenkinsci/ec2-plugin/blob/4f54ce9ea53331c7801b3314015d1530b123b642/src/main/java/hudson/plugins/ec2/SpotConfiguration.java#L201
Perhaps someone is willing to contribute a fix? :)
Potentially also include the default to onDemand option?
Just did the same on Azure, scoping to only the VMs of type highmem
spawned by ci.jenkins.io on the past 6 months:
The spot cost saving on Azure is roughly the same as in AWS (~60%) for this instance sizes:
Is there any objection for the Jenkins Infra team disabling the spot
mode for all highmem templates (both EC2 and Azure)?
With the following rationale:
=> Please note that this change would not have any direct effect on the BOM build. It might have benefits on it indirectly by not consuming spot instances in the same region.
Note that most branches of bom
builds complete pretty quickly (a few minutes). It is just a handful of plugins that have very slow test suites (up to an hour or so). Theoretically we could use parallel-test-executor
to split these up further, it just seemed like too much hassle to set up.
All bom
agents are K8s FYI. As of yesterday, these builds are using node retry (https://github.com/jenkinsci/bom/pull/1249).
Is there any objection for the Jenkins Infra team disabling the
spot
mode for all highmem templates (both EC2 and Azure)?
No objection from me.
Filed https://issues.jenkins.io/browse/JENKINS-68963 for the ec2 plugin to make the reason for the agent disappearing clear for a spot instance.
Seems like that the PR https://github.com/jenkins-infra/jenkins-infra/pull/2262 helped, along with the node retry.
I'm proceeding to close this issue: feel free to reopen (or open a new one) if you see new flakynesses.
Thanks a lot !
Service(s)
ci.jenkins.io
Summary
Seems agents are removed quite frequently:
It happened 4 times for this build: https://ci.jenkins.io/job/Tools/job/bom/view/change-requests/job/PR-1240/
Reproduction steps
No response