jenkinsci / ec2-fleet-plugin

The EC2 Fleet plugin launches EC2 instances as worker nodes for Jenkins CI server, automatically scaling the capacity with the load.
https://plugins.jenkins.io/ec2-fleet/
Apache License 2.0
111 stars 82 forks source link

EC2 ASG agents are not assgined to Jenkins fleet tags - Error during fleet '<fleet_name>' stats update java.lang.NullPointerException #429

Open tofanadrian3000 opened 10 months ago

tofanadrian3000 commented 10 months ago

Issue Details

Describe the bug We have multiple Amazon EC2 Fleets. All of them were working fine until yesterday (January 10th 2024). As of today, the ec2 fleet plugin started to scale the ASGs behind the fleets up, the ASG instances are started but they are not assigned to the fleet's tags. Therefore, Jenkins is not trying to connect them as agents anymore. Since they are not connected as agents, the plugin keeps scaling the ASGs up until they reach the maximum capacity without any of the agents being actually used as agents.

To Reproduce

  1. Create an Amazon EC2 Fleet by selecting any existing ASG as "EC2 Fleet" with any tag to it.
  2. The EC2 fleet plugin scaled the ASG up whenever a build is pending for a new agent with that tag.
  3. The ASG is being scaled up
  4. The new ASG instance is started
  5. The new ASG instance is not assigned for the tag

Logs Jan 11, 2024 10:29:16 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'g11n-lre-rus-asg' stats update java.lang.NullPointerException

Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'lt-pc-ci-asg' stats update java.lang.NullPointerException

Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info lt-infra-win-ci-asg [lt-infra-win-ci-asg] Set target capacity to '5' Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'lt-infra-win-ci-asg' stats update java.lang.NullPointerException

Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'win2022-ec2-fleet' stats update java.lang.NullPointerException

Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'g11n-lre-chs-asg' stats update java.lang.NullPointerException

Jan 11, 2024 10:29:18 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'g11n-lre-kor-asg' stats update java.lang.NullPointerException

Jan 11, 2024 10:29:18 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'nv-lin-ci-asg' stats update java.lang.NullPointerException

Jan 11, 2024 10:29:18 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'tc2-lin-import-asg' stats update java.lang.NullPointerException

image

Environment Details

Plugin Version? 3.2.0

Jenkins Version? 2.414.3

Spot Fleet or ASG? ASG

Label based fleet? No

Linux or Windows? So far Windows but it may happen on Linux as well (we haven't tested that yet but I don't think the OS being relevant in this case)

EC2Fleet Configuration as Code It's just a small part but: <clouds> <com.amazon.jenkins.ec2fleet.EC2FleetCloud plugin="ec2-fleet@3.2.0"> <actions/> <name>ubuntu22-ec2-fleet</name> <awsCredentialsId></awsCredentialsId> <region>eu-central-1</region> <endpoint></endpoint> <fleet>ubuntu22tplv2asg_asg</fleet> <fsRoot></fsRoot> <computerConnector class="hudson.plugins.sshslaves.SSHConnector" plugin="ssh-slaves@2.916.vd17b_43357ce4"> <port>22</port> <credentialsId>jenkins-ubuntu22-asg-slaves-ssh-key</credentialsId> <launchTimeoutSeconds>60</launchTimeoutSeconds> <maxNumRetries>10</maxNumRetries> <retryWaitTime>15</retryWaitTime> <sshHostKeyVerificationStrategy class="hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy"/> <tcpNoDelay>true</tcpNoDelay> </computerConnector> <privateIpUsed>true</privateIpUsed> <alwaysReconnect>true</alwaysReconnect> <labelString>ubuntu22-ec2-fleet</labelString> <idleMinutes>5</idleMinutes> <minSize>1</minSize> <maxSize>5</maxSize> <minSpareSize>0</minSpareSize> <numExecutors>50</numExecutors> <addNodeOnlyIfRunning>false</addNodeOnlyIfRunning> <restrictUsage>true</restrictUsage> <scaleExecutorsByWeight>false</scaleExecutorsByWeight> <executorScaler class="com.amazon.jenkins.ec2fleet.EC2FleetCloud$NoScaler"> <numExecutors>50</numExecutors> </executorScaler> <initOnlineTimeoutSec>300</initOnlineTimeoutSec> <cloudStatusIntervalSec>10</cloudStatusIntervalSec> <maxTotalUses>1000</maxTotalUses> <disableTaskResubmit>false</disableTaskResubmit> <noDelayProvision>false</noDelayProvision> </com.amazon.jenkins.ec2fleet.EC2FleetCloud> <com.amazon.jenkins.ec2fleet.EC2FleetCloud plugin="ec2-fleet@3.2.0"> <name>lt-pc-ci-asg</name> <awsCredentialsId></awsCredentialsId> <region>eu-central-1</region> <endpoint></endpoint> <fleet>lt-pc-ci-v2-asg_asg</fleet> <fsRoot>C:\jenkins</fsRoot> <computerConnector class="hudson.plugins.sshslaves.SSHConnector" plugin="ssh-slaves@2.916.vd17b_43357ce4"> <port>22</port> <credentialsId>jenkins-agents-lrelrpauto-account</credentialsId> <launchTimeoutSeconds>60</launchTimeoutSeconds> <maxNumRetries>10</maxNumRetries> <retryWaitTime>15</retryWaitTime> <sshHostKeyVerificationStrategy class="hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy"/> <tcpNoDelay>true</tcpNoDelay> </computerConnector> <privateIpUsed>true</privateIpUsed> <alwaysReconnect>true</alwaysReconnect> <labelString>lt-pc-ci-asg</labelString> <idleMinutes>5</idleMinutes> <minSize>0</minSize> <maxSize>10</maxSize> <minSpareSize>0</minSpareSize> <numExecutors>1</numExecutors> <addNodeOnlyIfRunning>false</addNodeOnlyIfRunning> <restrictUsage>true</restrictUsage> <scaleExecutorsByWeight>false</scaleExecutorsByWeight> <executorScaler class="com.amazon.jenkins.ec2fleet.EC2FleetCloud$NoScaler"> <numExecutors>1</numExecutors> </executorScaler> <initOnlineTimeoutSec>300</initOnlineTimeoutSec> <cloudStatusIntervalSec>10</cloudStatusIntervalSec> <maxTotalUses>-1</maxTotalUses> <disableTaskResubmit>false</disableTaskResubmit> <noDelayProvision>false</noDelayProvision> </com.amazon.jenkins.ec2fleet.EC2FleetCloud> </clouds>

Anything else unique about your setup? All the fleets (including the Windows ones) are configured to connect to the agents using ssh. I don't know if it's relevant in this case and it's not quite "unique" but maybe the information helps.

ajax-koval-i commented 9 months ago

@tofanadrian3000 i have the same issue, after init cloud, my asg init for example one ec2 spot instance. And i cannot see this instance in Jenkins -> Manage Jenkins -> Nodes.

But in AWS i have this instance. Do you have the same problem?

tofanadrian3000 commented 9 months ago

Yeap - it seems similar, indeed

tofanadrian3000 commented 9 months ago

I've just tried again to use an ec2 fleet and it seems to be working fine again now. I haven't changed absolutely anything about it so I've no idea what happened.

ajax-koval-i commented 9 months ago

I've just tried again to use an ec2 fleet and it seems to be working fine again now. I haven't changed absolutely anything about it so I've no idea what happened.

did you use asg or ec2 fleet ?

tofanadrian3000 commented 9 months ago

All my Jenkins clouds are created as "Amazon EC2 Fleet" and in the "EC2 Fleet" input field, I'm selecting between my AWS ASGs.

ajax-koval-i commented 9 months ago

@tofanadrian3000 did you use this cloud with freestyle projects ? or maybe with pipeline?

tofanadrian3000 commented 9 months ago

Pipelines

lukolszewski commented 7 months ago

Same issue, does anyone have any workarounds? This is making our jenkins unusable now. I tried recreating fleets, restarting Jenkins. Nothing helps. We have a very similar config and same situation. Our logs fill up with:

024-03-15 12:09:44.074+0000 [id=60] INFO c.a.jenkins.ec2fleet.CloudNanny#doRun: Error during fleet 'XXXXXX' stats update java.lang.NullPointerException at com.amazon.jenkins.ec2fleet.EC2FleetCloud.updateByState(EC2FleetCloud.java:634) at com.amazon.jenkins.ec2fleet.EC2FleetCloud.update(EC2FleetCloud.java:512) at com.amazon.jenkins.ec2fleet.CloudNanny.doRun(CloudNanny.java:57) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:92) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)

lukolszewski commented 7 months ago

Normal Jenkins restart doesn't make any difference. Hard restart (using systemctl) seems to make the problem go away (who knows for how long).

murtaza64 commented 5 months ago

We faced this issue and this is what turned out to be the solution for us:

We noticed that the CloudNanny errors in the log coincided with the time that we had deleted another EC2 Fleet Cloud via the Jenkins UI, but there were still three nodes from that cloud connected to the cluster. We manually deleted those three nodes from the Jenkins UI, and then the other EC2 Fleet cloud that was having issues started connecting agents again.

mrsombre commented 3 months ago

Today our fleets have stuck as described in this issue. Method @murtaza64 suggested helps. I manually deleted some nodes and jenkins become operating again.