Open tofanadrian3000 opened 10 months ago
@tofanadrian3000 i have the same issue, after init cloud, my asg init for example one ec2 spot instance. And i cannot see this instance in Jenkins -> Manage Jenkins -> Nodes.
But in AWS i have this instance. Do you have the same problem?
Yeap - it seems similar, indeed
I've just tried again to use an ec2 fleet and it seems to be working fine again now. I haven't changed absolutely anything about it so I've no idea what happened.
I've just tried again to use an ec2 fleet and it seems to be working fine again now. I haven't changed absolutely anything about it so I've no idea what happened.
did you use asg or ec2 fleet ?
All my Jenkins clouds are created as "Amazon EC2 Fleet" and in the "EC2 Fleet" input field, I'm selecting between my AWS ASGs.
@tofanadrian3000 did you use this cloud with freestyle projects ? or maybe with pipeline?
Pipelines
Same issue, does anyone have any workarounds? This is making our jenkins unusable now. I tried recreating fleets, restarting Jenkins. Nothing helps. We have a very similar config and same situation. Our logs fill up with:
024-03-15 12:09:44.074+0000 [id=60] INFO c.a.jenkins.ec2fleet.CloudNanny#doRun: Error during fleet 'XXXXXX' stats update java.lang.NullPointerException at com.amazon.jenkins.ec2fleet.EC2FleetCloud.updateByState(EC2FleetCloud.java:634) at com.amazon.jenkins.ec2fleet.EC2FleetCloud.update(EC2FleetCloud.java:512) at com.amazon.jenkins.ec2fleet.CloudNanny.doRun(CloudNanny.java:57) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:92) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)
Normal Jenkins restart doesn't make any difference. Hard restart (using systemctl) seems to make the problem go away (who knows for how long).
We faced this issue and this is what turned out to be the solution for us:
We noticed that the CloudNanny errors in the log coincided with the time that we had deleted another EC2 Fleet Cloud via the Jenkins UI, but there were still three nodes from that cloud connected to the cluster. We manually deleted those three nodes from the Jenkins UI, and then the other EC2 Fleet cloud that was having issues started connecting agents again.
Today our fleets have stuck as described in this issue. Method @murtaza64 suggested helps. I manually deleted some nodes and jenkins become operating again.
Issue Details
Describe the bug We have multiple Amazon EC2 Fleets. All of them were working fine until yesterday (January 10th 2024). As of today, the ec2 fleet plugin started to scale the ASGs behind the fleets up, the ASG instances are started but they are not assigned to the fleet's tags. Therefore, Jenkins is not trying to connect them as agents anymore. Since they are not connected as agents, the plugin keeps scaling the ASGs up until they reach the maximum capacity without any of the agents being actually used as agents.
To Reproduce
Logs Jan 11, 2024 10:29:16 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'g11n-lre-rus-asg' stats update java.lang.NullPointerException
Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'lt-pc-ci-asg' stats update java.lang.NullPointerException
Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.EC2FleetCloud info lt-infra-win-ci-asg [lt-infra-win-ci-asg] Set target capacity to '5' Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'lt-infra-win-ci-asg' stats update java.lang.NullPointerException
Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'win2022-ec2-fleet' stats update java.lang.NullPointerException
Jan 11, 2024 10:29:17 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'g11n-lre-chs-asg' stats update java.lang.NullPointerException
Jan 11, 2024 10:29:18 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'g11n-lre-kor-asg' stats update java.lang.NullPointerException
Jan 11, 2024 10:29:18 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'nv-lin-ci-asg' stats update java.lang.NullPointerException
Jan 11, 2024 10:29:18 AM INFO com.amazon.jenkins.ec2fleet.CloudNanny doRun Error during fleet 'tc2-lin-import-asg' stats update java.lang.NullPointerException
Environment Details
Plugin Version? 3.2.0
Jenkins Version? 2.414.3
Spot Fleet or ASG? ASG
Label based fleet? No
Linux or Windows? So far Windows but it may happen on Linux as well (we haven't tested that yet but I don't think the OS being relevant in this case)
EC2Fleet Configuration as Code It's just a small part but: <clouds> <com.amazon.jenkins.ec2fleet.EC2FleetCloud plugin="ec2-fleet@3.2.0"> <actions/> <name>ubuntu22-ec2-fleet</name> <awsCredentialsId></awsCredentialsId> <region>eu-central-1</region> <endpoint></endpoint> <fleet>ubuntu22tplv2asg_asg</fleet> <fsRoot></fsRoot> <computerConnector class="hudson.plugins.sshslaves.SSHConnector" plugin="ssh-slaves@2.916.vd17b_43357ce4"> <port>22</port> <credentialsId>jenkins-ubuntu22-asg-slaves-ssh-key</credentialsId> <launchTimeoutSeconds>60</launchTimeoutSeconds> <maxNumRetries>10</maxNumRetries> <retryWaitTime>15</retryWaitTime> <sshHostKeyVerificationStrategy class="hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy"/> <tcpNoDelay>true</tcpNoDelay> </computerConnector> <privateIpUsed>true</privateIpUsed> <alwaysReconnect>true</alwaysReconnect> <labelString>ubuntu22-ec2-fleet</labelString> <idleMinutes>5</idleMinutes> <minSize>1</minSize> <maxSize>5</maxSize> <minSpareSize>0</minSpareSize> <numExecutors>50</numExecutors> <addNodeOnlyIfRunning>false</addNodeOnlyIfRunning> <restrictUsage>true</restrictUsage> <scaleExecutorsByWeight>false</scaleExecutorsByWeight> <executorScaler class="com.amazon.jenkins.ec2fleet.EC2FleetCloud$NoScaler"> <numExecutors>50</numExecutors> </executorScaler> <initOnlineTimeoutSec>300</initOnlineTimeoutSec> <cloudStatusIntervalSec>10</cloudStatusIntervalSec> <maxTotalUses>1000</maxTotalUses> <disableTaskResubmit>false</disableTaskResubmit> <noDelayProvision>false</noDelayProvision> </com.amazon.jenkins.ec2fleet.EC2FleetCloud> <com.amazon.jenkins.ec2fleet.EC2FleetCloud plugin="ec2-fleet@3.2.0"> <name>lt-pc-ci-asg</name> <awsCredentialsId></awsCredentialsId> <region>eu-central-1</region> <endpoint></endpoint> <fleet>lt-pc-ci-v2-asg_asg</fleet> <fsRoot>C:\jenkins</fsRoot> <computerConnector class="hudson.plugins.sshslaves.SSHConnector" plugin="ssh-slaves@2.916.vd17b_43357ce4"> <port>22</port> <credentialsId>jenkins-agents-lrelrpauto-account</credentialsId> <launchTimeoutSeconds>60</launchTimeoutSeconds> <maxNumRetries>10</maxNumRetries> <retryWaitTime>15</retryWaitTime> <sshHostKeyVerificationStrategy class="hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy"/> <tcpNoDelay>true</tcpNoDelay> </computerConnector> <privateIpUsed>true</privateIpUsed> <alwaysReconnect>true</alwaysReconnect> <labelString>lt-pc-ci-asg</labelString> <idleMinutes>5</idleMinutes> <minSize>0</minSize> <maxSize>10</maxSize> <minSpareSize>0</minSpareSize> <numExecutors>1</numExecutors> <addNodeOnlyIfRunning>false</addNodeOnlyIfRunning> <restrictUsage>true</restrictUsage> <scaleExecutorsByWeight>false</scaleExecutorsByWeight> <executorScaler class="com.amazon.jenkins.ec2fleet.EC2FleetCloud$NoScaler"> <numExecutors>1</numExecutors> </executorScaler> <initOnlineTimeoutSec>300</initOnlineTimeoutSec> <cloudStatusIntervalSec>10</cloudStatusIntervalSec> <maxTotalUses>-1</maxTotalUses> <disableTaskResubmit>false</disableTaskResubmit> <noDelayProvision>false</noDelayProvision> </com.amazon.jenkins.ec2fleet.EC2FleetCloud> </clouds>
Anything else unique about your setup? All the fleets (including the Windows ones) are configured to connect to the agents using ssh. I don't know if it's relevant in this case and it's not quite "unique" but maybe the information helps.