jetersen commented 2 years ago

Service(s)

ci.jenkins.io

Summary

Seems agents are removed quite frequently:

09:16:17  Cannot contact jnlp-maven-11-8srg9: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@20e353cb:JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:3319": Remote call on JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:3319 failed. The channel is closing down or has closed down
09:21:11  Agent jnlp-maven-11-8srg9 was deleted; cancelling node body
09:21:11  Could not connect to jnlp-maven-11-8srg9 to send interrupt signal to process

It happened 4 times for this build: https://ci.jenkins.io/job/Tools/job/bom/view/change-requests/job/PR-1240/

Reproduction steps

No response

jetersen commented 2 years ago

Another build with agent being removed: https://ci.jenkins.io/job/Tools/job/bom/job/master/1075/

[2022-07-04T21:41:11.039Z] Cannot contact jnlp-maven-11-c4hfh: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
[2022-07-04T21:46:14.791Z] Could not connect to jnlp-maven-11-c4hfh to send interrupt signal to process

This is really troublesome for longer builds such as Jenkins, ATH, BOM or git-plugin. If agents being removed breaks the build.

What plugin is it that says build should fail if agent is removed. Why not retry the steps if agent is removed with a new agent.

@jglick do you think there is something we could improve in BOM build pipeline to retry build if x failure condition is met? Such as agent removed than retry the plugin test?

Yet Another: https://ci.jenkins.io/job/Tools/job/bom/job/master/1076/ Yet Another: https://ci.jenkins.io/job/Tools/job/bom/job/master/1077/

MarkEWaite commented 2 years ago

The agent availability check job runs every 4 hours to check that ci.jenkins.io agents can be allocated. It has been failing much more frequently in the last few days.

dduportal commented 2 years ago

Hello @jetersen , thanks for reporting.

We have different (parallel) issues on ci.jenkins.io that make it hard to tackle.

However in the job you reported, the common denominator are they are all BOM builds.

This job is a big consumer of executors on Kubernetes agents: ~180 per build, while we only provide ~150 pods simultaneously. It creates pressure but still the way ci.jenkins.io behaves is weird.

@jglick is working on an improvement for the retry instruction that could help the builds to be automatically re-triggered with such errors: tracking in https://github.com/jenkins-infra/helpdesk/issues/2984. This one should help to make this "less irritant".

Also, we have @lemeurherve whom is working in increasing the partneship with DigitalOcean so we could have more compute capacity.

dduportal commented 2 years ago

The agent availability check job runs every 4 hours to check that ci.jenkins.io agents can be allocated. It has been failing much more frequently in the last few days.

After checking the build history of the acceptance job, I confirm that it is another kind of failures: failures are all about VM agents that cannot be started because of a quota of public IP in Azure. Work in progress on this.

jglick commented 2 years ago

something we could improve in BOM build pipeline to retry build

Note there is already a crude check: https://github.com/jenkinsci/bom/blob/c2b4fb2fe2690cb8abc160f774ba71cb1a5efecb/Jenkinsfile#L51-L52

2984 would allow us to limit this to infrastructure issues, so we do not waste time retrying branches that failed for genuine reasons.

jetersen commented 2 years ago

Note there is already a crude check: https://github.com/jenkinsci/bom/blob/c2b4fb2fe2690cb8abc160f774ba71cb1a5efecb/Jenkinsfile#L51-L52

That retry does not work for agent removed as far as I can see. The pipeline basically goes to a halt.

Yup, looking at the consoleText for https://ci.jenkins.io/job/Tools/job/bom/job/master/1075/ I only see Attempt 1 of 2 echos. No Attempt 2 of 2.

jglick commented 2 years ago

Hmm, it should work except in cases where the controller was restarted in the middle. I think the problem is that FlowInterruptedException.actualInterruption is getting defaulted to true. Something else to fix.

lemeurherve commented 2 years ago

We suspect agents on spot instances being killed as AWS requested them back.

We switched from "on demand" to "spot" EC2 highmem instances to reduce infra bugdet from 12k€ to 9k€ per month. We cannot increase 10k€ so maybe we should stop using EC2 for ATH.

As noted elsewhere by @jtnord, the most time we are guaranteed to have a spot instance is 2 minutes

spot instances can be terminated whenever Amazon feels like they will make more money by giving the underlying hardware to someone not using spot (ie there is not spare capacity). you get 2 minutes notice of this - so you know the agent (host) will always be around for 2 minutes. so yes every minute you use more than 2 minutes is a chance that the host will be reclaimed. that %'age is not fixed - it varies at certain times of the day due to demand :slightly_smiling_face: But for arguments sake say it was fixed (at least within say a 2 hour period) if the chance it is reclaimed in any minute is say 0.01% after 62 minutes the chance the instance is still alive is only 55% so the longer your task is the less you should actually use spot

Plan of actions:

[ ] Check on AWS console if we had spot instances reclaimed (and how much). As @smerle33 caught on ci.jenkins.io, we had a lot of HTTP/400 to EC2 because of max limit of spot instances. That could be another clue
[ ] Disable “spot” for highmem EC2 instances (so we can check the impact on budget and confirm that spot is the major culprit)
[ ] Update EC2 agent template on ci.jenkins.io to provide 1 “ondemand” and 1 “spot” (spot being the default) so developers could be more autonomous to unblock their builds (ATH seems to be a good candidate for the “ondemand” instance)
[ ] Get [ci.jenkins.io] collect datadog metrics for ephemeral VMs started. Having metrics kept of the ephemeral agent could help to diagnose the CPU/memory usage at least

jetersen commented 2 years ago

@lemeurherve the ec2 template when using spot instances could use larger set of instancePools? Also potentially the spot instance template could default to onDemand if no spot instances are available?

jtnord commented 2 years ago

We have been seeing instability in the acceptance-test-harness jobs presumably because of the spot reclamation.

I say presumably as I have no access to AWS to actually tell if this is (or is not the case). see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-interrupted-Spot-Instance.html for how to see spot instance reclamation.

Whilst we now have the retry from @jglick and this does at least seem to be better for long running jobs it may not be the best thing (a branch takes approx 40 minutes in ATH - so we may be better off running more branches - subject to the limits mentioned above or not using spot here at all but on demand).

jtnord commented 2 years ago

Additionally I think the ec2 plugin (I assume we are using that for spot instnace) should really note in a build log that it is being terminated so you know why the agent has gone

lemeurherve commented 2 years ago

FYI, here is the "highmem" EC2 instances we used for ci.jenkins.io this last year:

Another summary about spot instances:

(Note: the highlighted line includes EC2 instances used for ci.jenkins.io highmem agents, but also the EKS cluster `cik8s` used for infra.ci.jenkins.io)

dduportal commented 2 years ago

@lemeurherve the ec2 template when using spot instances could use larger set of instancePools?

For the EKS cluster that provides container agents for ci.jenkins.io (=> BOM builds for instance) it's already the case yes. For the EC2 VM agents (type highmem, used by ATH for instance), we don't know if it is possible: currently checking the EC2 plugin used for that.

Also potentially the spot instance template could default to onDemand if no spot instances are available?

That is the default behavior of what we configured for the EC2 VM agents yep, good tip!

lemeurherve commented 2 years ago

I say presumably as I have no access to AWS to actually tell if this is (or is not the case). see docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-interrupted-Spot-Instance.html for how to see spot instance reclamation.

Here are the instances reclaimed (no "instance-stopped-no-capacity"):

From https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-request-status.html

lemeurherve commented 2 years ago

@lemeurherve the ec2 template when using spot instances could use larger set of instancePools?

For the EKS cluster that provides container agents for ci.jenkins.io (=> BOM builds for instance) it's already the case yes. For the EC2 VM agents (type highmem, used by ATH for instance), we don't know if it is possible: currently checking the EC2 plugin used for that.

@jetersen we've checked the ec2 plugin config on ci.jenkins.io and in its doc but we didn't found a way to configure an instance pool for it.

Screenshot:

jetersen commented 2 years ago

@res0nance would know better about the EC2 plugin as he is one of the maintainers.

Perhaps instance type accepts comma separation?

So close, yet no cigar: https://github.com/jenkinsci/ec2-plugin/blob/4f54ce9ea53331c7801b3314015d1530b123b642/src/main/java/hudson/plugins/ec2/SpotConfiguration.java#L201

Perhaps someone is willing to contribute a fix? :)

Potentially also include the default to onDemand option?

dduportal commented 2 years ago

Just did the same on Azure, scoping to only the VMs of type highmem spawned by ci.jenkins.io on the past 6 months:

Click to see screenshot

The spot cost saving on Azure is roughly the same as in AWS (~60%) for this instance sizes:

Click to see screenshot

dduportal commented 2 years ago

Is there any objection for the Jenkins Infra team disabling the spot mode for all highmem templates (both EC2 and Azure)? With the following rationale:

The ATH builds stays ~40 per VM (highmem) as pointed by @jtnord : probability of spot reclaiming growth over time and the ATH is not a job that builds in a few minutes
In the case of Agent VMs, the cost saving is visible BUT:
- Using ondemand would costs us (extrapolating with the past 6 months) ~$550 more on AWS and ~$350 more on Azure with the same amount of machine usages
- As underlined by @jglick and @jtnord , if not reclaiming agents avoid wasting time in rebuilds/retries, then it means that we could potentially save costs in that area (assuming that the amount of build does not increase).

=> Please note that this change would not have any direct effect on the BOM build. It might have benefits on it indirectly by not consuming spot instances in the same region.

jglick commented 2 years ago

Note that most branches of bom builds complete pretty quickly (a few minutes). It is just a handful of plugins that have very slow test suites (up to an hour or so). Theoretically we could use parallel-test-executor to split these up further, it just seemed like too much hassle to set up.

All bom agents are K8s FYI. As of yesterday, these builds are using node retry (https://github.com/jenkinsci/bom/pull/1249).

MarkEWaite commented 2 years ago

Is there any objection for the Jenkins Infra team disabling the spot mode for all highmem templates (both EC2 and Azure)?

No objection from me.

jtnord commented 2 years ago

Filed https://issues.jenkins.io/browse/JENKINS-68963 for the ec2 plugin to make the reason for the agent disappearing clear for a spot instance.

dduportal commented 2 years ago

Seems like that the PR https://github.com/jenkins-infra/jenkins-infra/pull/2262 helped, along with the node retry.

I'm proceeding to close this issue: feel free to reopen (or open a new one) if you see new flakynesses.

Thanks a lot !

jenkins-infra / helpdesk

ci.jenkins.io agents are very flaky #3031

Service(s)

Summary

Reproduction steps

2984 would allow us to limit this to infrastructure issues, so we do not waste time retrying branches that failed for genuine reasons.

Plan of actions: