Closed limeman40 closed 10 months ago
Is it possible the VM is getting overloaded? What are the metrics like?
I am not seeing high metrics on any of the spun up VMs from gallery images do you have anything else I should look at on my end?
Maybe ask Microsoft about the health events?
I did have a support case open with them. I have asked to escalate the issue it seems like a HyperVisor issue perhaps from the messages I am seeing in Azure logs
On Thu, Nov 2, 2023 at 12:38 PM Tim Jacomb @.***> wrote:
Maybe ask Microsoft about the health events?
— Reply to this email directly, view it on GitHub https://github.com/jenkinsci/azure-vm-agents-plugin/issues/476#issuecomment-1791087395, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATBNP2UDG6ENHLEKZAEKB2TYCPEANAVCNFSM6AAAAAA6ZTCCUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJRGA4DOMZZGU . You are receiving this because you authored the thread.Message ID: @.***>
I had another question. I had an issue with Jenkins where I accident deleted system files for it. I had to restore the whole VM from a snapshot. It seems to be working fine. However it almost like this issue lines up with that timeline.
Is there a way I could export the existing plugin configuration then I could remove the plugin and completely reinstall it. I am just curious if that could help this issue.
using this plugin would be the easiest probably: https://github.com/jenkinsci/configuration-as-code-plugin
otherwise you could copy the config out from the config.xml for the clouds section
I ended up just jotting down all the configuration in a couple text files and pulled the plugin out and reinstalled it. Am curious if this helps fix it.
I will have to keep an eye on this tomorrow. I will come back and close this bug if this indeed solves it.
Did not fix the issue how does the cleanup process work in the plugin? I am wondering if there some hiccup on the Azure side it recovers but the plugin thinks the VM is broken and has it removed.
Is it possible this is some kind of race condition I am seeing?
Did not fix the issue how does the cleanup process work in the plugin? I am wondering if there some hiccup on the Azure side it recovers but the plugin thinks the VM is broken and has it removed.
Is it possible this is some kind of race condition I am seeing?
Unsure I haven't used the Pool retention strategy in awhile. I use the idle one set to timeout of 5 minutes and it works fine.
Can you do some testing on Pool Retention?
On Tue, Nov 7, 2023 at 11:55 AM Tim Jacomb @.***> wrote:
Did not fix the issue how does the cleanup process work in the plugin? I am wondering if there some hiccup on the Azure side it recovers but the plugin thinks the VM is broken and has it removed.
Is it possible this is some kind of race condition I am seeing?
Unsure I haven't used the Pool retention strategy in awhile. I use the idle one set to timeout of 5 minutes and it works fine.
— Reply to this email directly, view it on GitHub https://github.com/jenkinsci/azure-vm-agents-plugin/issues/476#issuecomment-1799198768, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATBNP2VDQED42UML54ZTQVTYDJRYLAVCNFSM6AAAAAA6ZTCCUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJZGE4TQNZWHA . You are receiving this because you authored the thread.Message ID: @.***>
I will give idle retention a try and see if works better
On Tue, Nov 7, 2023 at 1:44 PM limeman @.***> wrote:
Can you do some testing on Pool Retention?
On Tue, Nov 7, 2023 at 11:55 AM Tim Jacomb @.***> wrote:
Did not fix the issue how does the cleanup process work in the plugin? I am wondering if there some hiccup on the Azure side it recovers but the plugin thinks the VM is broken and has it removed.
Is it possible this is some kind of race condition I am seeing?
Unsure I haven't used the Pool retention strategy in awhile. I use the idle one set to timeout of 5 minutes and it works fine.
— Reply to this email directly, view it on GitHub https://github.com/jenkinsci/azure-vm-agents-plugin/issues/476#issuecomment-1799198768, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATBNP2VDQED42UML54ZTQVTYDJRYLAVCNFSM6AAAAAA6ZTCCUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJZGE4TQNZWHA . You are receiving this because you authored the thread.Message ID: @.***>
I just saw this in the Jenkins logs:
java.io.IOException: Agent failed to connect, even though the launcher didn't report it. See the log output for details. at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:325) Caused: java.util.concurrent.ExecutionException at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) at com.microsoft.azure.vmagent.AzureVMCloud$2.call(AzureVMCloud.java:856) Caused: com.microsoft.azure.vmagent.exceptions.AzureCloudException at com.microsoft.azure.vmagent.exceptions.AzureCloudException.create(AzureCloudException.java:54) at com.microsoft.azure.vmagent.exceptions.AzureCloudException.create(AzureCloudException.java:33) at com.microsoft.azure.vmagent.AzureVMCloud$2.call(AzureVMCloud.java:885) at com.microsoft.azure.vmagent.AzureVMCloud$2.call(AzureVMCloud.java:808) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)
this error as well:
java.lang.Exception: Node ProvisioningActivity for Azure-Cloud/winagent/null (-1363608790) has lost. Mark as failure at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.cleanCloudStatistics(AzureVMAgentCleanUpTask.java:577) at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.clean(AzureVMAgentCleanUpTask.java:596) at com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.lambda$execute$1(AzureVMAgentCleanUpTask.java:604) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)
This seems to be when it is trying to spin up a VM
I made this change it is working better but I am still getting random disconnects from Azure. Is there anything I can do to get more details on why this is happening from the plugin?
I also setup an SSH logger in Jenkins to see if it perhaps might be some kind of SSH disconnect I am seeing this in those logs:
Failed connecting to host 10.188.0.39:22. java.net.NoRouteToHostException: No route to host (Host unreachable) at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412) at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255) at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237) at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.base/java.net.Socket.connect(Socket.java:609) at java.base/java.net.Socket.connect(Socket.java:558) at java.base/java.net.Socket.<init>(Socket.java:454) at java.base/java.net.Socket.<init>(Socket.java:231) at com.jcraft.jsch.Util.lambda$createSocket$0(Util.java:389) Caused: com.jcraft.jsch.JSchException at com.jcraft.jsch.Util.createSocket(Util.java:417) at com.jcraft.jsch.Session.connect(Session.java:217) at com.jcraft.jsch.Session.connect(Session.java:187) at com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher.getRemoteSession(AzureVMAgentSSHLauncher.java:311) at com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher.connectToSsh(AzureVMAgentSSHLauncher.java:457) at com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher.launch(AzureVMAgentSSHLauncher.java:111) at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)
We have had been luck changing the IDLE retention strategy but the disconnects still happens but now an agent might last over an hour before it does.
We have also done a test where we just statically connect an agent and those do not disconnect at all. So I am thinking it is some issue with the cleanup process for this plugin.
Does the cleanup process not take into account any of the agent Node Monitoring changes? I have response time turned off in mine so Jenkins will not randomly like disconnect agents. Whatever it is seems to be on the plugin side. We enjoy using this plugin but will have to stop if it keeps being unstable solution for us.
Failed connecting to host 10.188.0.39:22. java.net.NoRouteToHostException: No route to host (Host unreachable) at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) at
That could be on initial startup before the VM is available. If it's while it's running something is wrong.
Really unsure all I can say is we use it for 1000s of builds a day and it works really well without this issue.
I am seeing a lot of Health Event messages in the Activity Logs in Azure. Is it possible the cleanup process is like cleaning up things that are in use:
"details": "This virtual machine is stopped and deallocated as requested by an authorized user or process.",
"title": "Down: Virtual machine has been unavailable for 15 minutes",
I am not sure where to look at this point. It seems like the cleanup process is cleaning up things that need to not be cleaned up
How does this class for instance com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask
Is it possible something about this is broken in the current version?
What logging can I turn on that might give me more of an idea what is happening?
Enabling com.microsoft.azure
(there should already be a log recorder setup for this)
Should give you all the plugins logging
you figured it out?
No I just opened a new issue. It seems more cleanup task realted
On Fri, Nov 17, 2023 at 4:22 PM Tim Jacomb @.***> wrote:
you figured it out?
— Reply to this email directly, view it on GitHub https://github.com/jenkinsci/azure-vm-agents-plugin/issues/476#issuecomment-1817124062, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATBNP2WLZMBY33NISLBCHFLYE7IS3AVCNFSM6AAAAAA6ZTCCUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJXGEZDIMBWGI . You are receiving this because you modified the open/close state.Message ID: @.***>
Jenkins and plugins versions report
What Operating System are you using (both controller, and any agents involved in the problem)?
Controller: Ubuntu 22.04.3 LTS
Agents Ubuntu 22.04.3 LTS Agent: Windows Server 2019 Datacente
Reproduction steps
Allow instance to spin up VMs based on gallery images wait for them to disconnect.
Expected Results
The VMs stay up and complete the jobs that Jenkins tells them to build
Actual Results
VMs disconnect at randomly intervals often when they are in the middle of building out code
Anything else?
This just started happening like a couple weeks ago. We are on the latest release of this plugin. I am unsure what to do I am also seeing some messaging in the Azure logs but I am not sure who is to blame at this point.
I am seeing a lot of these message in the Jenkins.log file:
On the Azure side I see things like this: