Describe the bug
We have an EC2 Fleet set up to run instances on an AutoScaling Group, and have a default minimum size of 0 (we don't want to have instances running if there are no jobs running). We set up the fleet using Terraform, and set up Jenkins using CasC to configure the fleet within Jenkins. (below are the configurations for the fleet in CasC).
However, we are seeing that nodes created using the AutoScaling Group we created are terminated after ~15 minutes of uptime, due to an event being sent to AWS to tell the ASG to scale down from 1 to 0. This happens consistently, as we had a job running that would restart if the node was lost, and it ran repeatedly overnight, restarting very 15 minutes.
In the CloudTrail events, we see the ASG receives a request from the ec2-fleet-plugin to scale down from 1 to 0 nodes, then the instance is terminated, even though there is a job running, and Scale In protection is enabled. In the ASG events, we see that the scale down request was prevented by the Scale In protections, but the node is still terminated anyway.
I saw a few other tickets where this issue was related to the maximum jobs for the node, our's is set to -1 to have unlimited uses. In addition, it never actually finishes any job, as the job it's running takes longer than 15 minutes.
To Reproduce
Create an ASG with a default desired capacity of 0
Configure the ASG in the Jenkins EC2 Fleet with an original desired capacity of 0
Start a long running (more than 15 minutes) job in AWS that runs on this ASG
After ~15 minutes, the node is terminated, and a new node is started immediately.
Logs
From the ASG Events (Note the time of the scale-down cancel is 12:35:20Z:
Successful Terminating EC2 instance: i-08912acdba64190f4 At 2024-03-18T12:36:08Z an instance was taken out of service in response to an EC2 health check indicating it has been terminated or stopped. 2024 March 18, 08:36:08 AM -04:00 2024 March 18, 08:36:50 AM -04:00
Cancelled Could not scale to desired capacity because all remaining instances are protected from scale-in. At 2024-03-18T12:35:20Z a user request update of AutoScalingGroup constraints to min: 0, max: 15, desired: 0 changing the desired capacity from 1 to 0. At 2024-03-18T12:35:31Z group reached equilibrium. 2024 March 18, 08:35:31 AM -04:00 2024 March 18, 08:35:31 AM -04:00
Successful Launching a new EC2 instance: i-08912acdba64190f4 At 2024-03-18T12:28:08Z an instance was launched in response to an unhealthy instance needing to be replaced. 2024 March 18, 08:28:11 AM -04:00 2024 March 18, 08:32:17 AM -04:00
From the AWS CloudTrail logs for the Instance (this event is at 12:35:20Z as well, and was initiated by the ec2-fleet-plugin):
I've included the best logs I could find. There are no additional FINE logs in the ec2fleet logger or default logger in Jenkins that have any additional information.
Jenkins Logs at the time of the scale-down request (there are no mentions of scale-down, only that the node is already terminated (no logs at 12:35:20Z):
Mar 18 12:35:01 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:01.295+0000 [id=84] INFO c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'ec2-fleet-linux-medium i-123'. Attempting to connect and waiting before retry
Mar 18 12:35:01 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:01.376+0000 [id=63770] WARNING h.plugins.sshslaves.SSHLauncher#launch: SSH Launch of i-123 on 10.2.10.16 failed in 80 ms
Mar 18 12:35:06 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:06.266+0000 [id=42] INFO c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-linux-medium]: No excess workload, provisioning not needed.
Mar 18 12:35:06 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:06.266+0000 [id=42] INFO c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-windows-large]: No excess workload, provisioning not needed.
Mar 18 12:35:13 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:13.353+0000 [id=84] INFO c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'ec2-fleet-windows-large i-234'. Attempting to connect and waiting before retry
Mar 18 12:35:14 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:14.058+0000 [id=63770] WARNING h.plugins.sshslaves.SSHLauncher#launch: SSH Launch of i-234 on 10.2.0.176 failed in 703 ms
Mar 18 12:35:16 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:16.266+0000 [id=39] INFO c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-linux-medium]: No excess workload, provisioning not needed.
Mar 18 12:35:16 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:16.266+0000 [id=39] INFO c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-windows-large]: No excess workload, provisioning not needed.
Mar 18 12:35:16 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:16.296+0000 [id=84] INFO c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'ec2-fleet-linux-medium i-123'. Attempting to connect and waiting before retry
Mar 18 12:35:16 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:16.363+0000 [id=63770] WARNING h.plugins.sshslaves.SSHLauncher#launch: SSH Launch of i-123 on 10.2.10.16 failed in 66 ms
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.136+0000 [id=44] INFO c.a.j.ec2fleet.EC2FleetCloud#info: ec2-fleet-windows-large [ec2-fleet-windows-large] Fleet 'ec2-fleet-windows-large' no longer has the instance 'i-234'. Removing instance from Jenkins
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.136+0000 [id=63770] INFO c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: DISCONNECTED: ec2-fleet-windows-large i-234
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.137+0000 [id=63770] INFO c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Start retriggering executors for ec2-fleet-windows-large i-234
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.137+0000 [id=63770] INFO c.a.j.e.EC2FleetAutoResubmitComputerLauncher#afterDisconnect: Finished retriggering executors for ec2-fleet-windows-large i-234
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.352+0000 [id=44] INFO c.a.j.e.EC2RetentionStrategy#isIdleForTooLong: Instance ec2-fleet-windows-medium i-04c2baecb91271418 has been idle for too long (Age: 23006529, Max Age: 300000).
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.352+0000 [id=44] INFO c.a.j.ec2fleet.EC2FleetCloud#info: ec2-fleet-windows-medium [ec2-fleet-windows-medium ec2-fleet] Not scheduling instance 'i-345' for termination because we need a minimum of 2 instance(s) running
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.352+0000 [id=44] INFO c.a.j.e.EC2RetentionStrategy#isIdleForTooLong: Instance:ec2-fleet-windows-medium i-456 Age: 39979 Max Age:300000
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.352+0000 [id=44] INFO c.a.j.e.EC2RetentionStrategy#isIdleForTooLong: Instance:ec2-fleet-linux-medium i-123 Age: 7056 Max Age:300000
Mar 18 12:35:23 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:23.353+0000 [id=44] INFO c.a.j.ec2fleet.EC2FleetCloud#info: ec2-fleet-windows-large [ec2-fleet-windows-large] Skipping label update, the Jenkins node for instance 'i-234' was null
Mar 18 12:35:26 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:26.266+0000 [id=46] INFO c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-linux-medium]: No excess workload, provisioning not needed.
Mar 18 12:35:26 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:26.266+0000 [id=46] INFO c.a.j.e.NoDelayProvisionStrategy#apply: label [ec2-fleet-windows-large]: No excess workload, provisioning not needed.
Mar 18 12:35:28 ip-10-2-6-10 jenkins[1868]: 2024-03-18 12:35:28.354+0000 [id=84] INFO c.a.j.e.EC2FleetOnlineChecker#run: No connection to node 'ec2-fleet-windows-large i-234'. Waiting before retry
Environment Details
Plugin Version?
3.2.0 and 2.7.0 (tried both versions)
Issue Details
Describe the bug We have an EC2 Fleet set up to run instances on an AutoScaling Group, and have a default minimum size of 0 (we don't want to have instances running if there are no jobs running). We set up the fleet using Terraform, and set up Jenkins using CasC to configure the fleet within Jenkins. (below are the configurations for the fleet in CasC).
However, we are seeing that nodes created using the AutoScaling Group we created are terminated after ~15 minutes of uptime, due to an event being sent to AWS to tell the ASG to scale down from 1 to 0. This happens consistently, as we had a job running that would restart if the node was lost, and it ran repeatedly overnight, restarting very 15 minutes.
In the CloudTrail events, we see the ASG receives a request from the ec2-fleet-plugin to scale down from 1 to 0 nodes, then the instance is terminated, even though there is a job running, and Scale In protection is enabled. In the ASG events, we see that the scale down request was prevented by the Scale In protections, but the node is still terminated anyway.
I saw a few other tickets where this issue was related to the maximum jobs for the node, our's is set to
-1
to have unlimited uses. In addition, it never actually finishes any job, as the job it's running takes longer than 15 minutes.To Reproduce
Logs From the ASG Events (Note the time of the scale-down cancel is 12:35:20Z:
From the AWS CloudTrail logs for the Instance (this event is at 12:35:20Z as well, and was initiated by the ec2-fleet-plugin):
From the AWS CloudTrail logs for this AutoScaling Group (this event is at 12:35:20Z, and was initiated by the ec2-fleet-plugin):
I've included the best logs I could find. There are no additional FINE logs in the ec2fleet logger or default logger in Jenkins that have any additional information.
Jenkins Logs at the time of the scale-down request (there are no mentions of scale-down, only that the node is already terminated (no logs at 12:35:20Z):
Environment Details
Plugin Version? 3.2.0 and 2.7.0 (tried both versions)
Jenkins Version? 2.440.1
Spot Fleet or ASG? ASG
Label based fleet? Yes
Linux or Windows? Windows
EC2Fleet Configuration as Code