jenkinsci / azure-vm-agents-plugin

This repo is for azure vm agents plugin for jenkins. Azure devops CICD is the team which owns it for now
https://plugins.jenkins.io/azure-vm-agents/
43 stars 96 forks source link

Plugins cleanup actions is removing VMs that are in use and working. #481

Closed limeman40 closed 9 months ago

limeman40 commented 10 months ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.432 OS: Linux - 6.2.0-1016-azure Java: 11.0.20.1 - Ubuntu (OpenJDK 64-Bit Server VM) --- Office-365-Connector:4.20.2 ace-editor:1.1 ansible:285.v2f044b_eb_7a_3e ant:497.v94e7d9fffa_b_9 antisamy-markup-formatter:162.v0e6ec0fcfcf6 apache-httpcomponents-client-4-api:4.5.14-208.v438351942757 apache-httpcomponents-client-5-api:5.2.1-1.1 async-http-client:1.9.40.0 authentication-tokens:1.53.v1c90fd9191a_b_ azure-acs:1.0.4 azure-ad:433.v1982e2b_b_4a_fe azure-app-service:1.0.2 azure-artifact-manager:133.vf94ad3455cdc azure-cli:0.9 azure-commons:1.1.3 azure-container-agents:253.vd2f5cd5c5040 azure-container-registry-tasks:0.6.5 azure-credentials:293.vb_d506148f506 azure-credentials-ext:1.0 azure-function:0.3.3 azure-keyvault:228.va_31b_a_451e7d6 azure-sdk:157.v855da_0b_eb_dc2 azure-vm-agents:883.v63c930b_025dc azure-vmss:0.2.4 badge:1.9.1 bitbucket:223.vd12f2bca5430 blackduck-detect:9.0.0 block-queued-job:0.2.0 blueocean-bitbucket-pipeline:1.27.9 blueocean-commons:1.27.9 blueocean-core-js:1.27.9 blueocean-jwt:1.27.9 blueocean-pipeline-api-impl:1.27.9 blueocean-pipeline-scm-api:1.27.9 blueocean-rest:1.27.9 blueocean-rest-impl:1.27.9 blueocean-web:1.27.9 bootstrap4-api:4.6.0-6 bootstrap5-api:5.3.2-2 bouncycastle-api:2.29 branch-api:2.1135.v8de8e7899051 build-user-vars-plugin:1.9 caffeine-api:3.1.8-133.v17b_1ff2e0599 changes-since-last-success:0.6 checks-api:2.0.2 cloud-stats:320.v96b_65297a_4b_b_ cloudbees-bitbucket-branch-source:848.v42c6a_317eda_e cloudbees-folder:6.858.v898218f3609d command-launcher:107.v773860566e2e commons-httpclient3-api:3.1-3 commons-lang3-api:3.13.0-62.v7d18e55f51e2 commons-text-api:1.11.0-94.v3e1f4a_926e49 conditional-buildstep:1.4.3 config-file-provider:959.vcff671a_4518b_ copyartifact:722.v0662a_9b_e22a_c credentials:1309.v8835d63eb_d8a_ credentials-binding:642.v737c34dea_6c2 crx-content-package-deployer:1.9 data-tables-api:1.13.6-5 datadog:5.6.0 digitalocean-plugin:1.3.1 display-url-api:2.200.vb_9327d658781 docker-commons:439.va_3cb_0a_6a_fb_29 docker-java-api:3.3.1-79.v20b_53427e041 durable-task:523.va_a_22cf15d5e0 echarts-api:5.4.0-7 envinject:2.908.v66a_774b_31d93 envinject-api:1.199.v3ce31253ed13 extended-read-permission:53.v6499940139e5 extensible-choice-parameter:1.8.1 external-monitor-job:215.v2e88e894db_f8 favorite:2.4.3 font-awesome-api:6.4.2-1 generic-webhook-trigger:1.88.0 git:5.2.1 git-client:4.5.0 git-parameter:0.9.19 git-server:99.va_0826a_b_cdfa_d github:1.37.3.1 github-api:1.316-451.v15738eef3414 github-branch-source:1741.va_3028eb_9fd21 github-pullrequest:0.5.0 gitlab-api:5.3.0-91.v1f9a_fda_d654f gitlab-branch-source:684.vea_fa_7c1e2fe3 google-metadata-plugin:0.5 google-oauth-plugin:1.318.vb_39c5db_e3041 gradle:2.9 handlebars:3.0.8 handy-uri-templates-2-api:2.1.8-22.v77d5b_75e6953 htmlpublisher:1.32 instance-identity:185.v303dc7c645f9 ionicons-api:56.v1b_1c8c49374e jackson2-api:2.15.3-372.v309620682326 jakarta-activation-api:2.0.1-3 jakarta-mail-api:2.0.1-3 javadoc:243.vb_b_503b_b_45537 javax-activation-api:1.2.0-6 javax-mail-api:1.6.2-9 jaxb:2.3.9-1 jdk-tool:73.vddf737284550 jenkins-design-language:1.27.9 jersey2-api:2.41-133.va_03323b_a_1396 jjwt-api:0.11.5-77.v646c772fddb_0 jnr-posix-api:3.1.18-1 jobConfigHistory:1229.v3039470161a_d jquery:1.12.4-1 jquery-detached:1.2.1 jquery3-api:3.7.1-1 jsch:0.2.8-65.v052c39de79b_2 junit:1240.vf9529b_881428 kubernetes-cd:2.3.1 kubernetes-client-api:6.8.1-224.vd388fca_4db_3b_ kubernetes-credentials:0.11 label-linked-jobs:6.0.1 ldap:711.vb_d1a_491714dc lockable-resources:1185.v0c528656ce04 mailer:463.vedf8358e006b_ mapdb-api:1.0.9-28.vf251ce40855d matrix-auth:3.2.1 matrix-project:818.v7eb_e657db_924 maven-plugin:3.23 mercurial:1260.vdfb_723cdcc81 metrics:4.2.18-442.v02e107157925 mina-sshd-api-common:2.11.0-86.v836f585d47fa_ mina-sshd-api-core:2.11.0-86.v836f585d47fa_ momentjs:1.1.1 msbuild:1.30 nexus-jenkins-plugin:3.16.510.v4d23e22cf563 node-iterator-api:55.v3b_77d4032326 node-sharing-executor:2.0.8 oauth-credentials:0.646.v02b_66dc03d2e okhttp-api:4.11.0-157.v6852a_a_fa_ec11 pam-auth:1.10 pipeline-build-step:516.v8ee60a_81c5b_9 pipeline-graph-analysis:202.va_d268e64deb_3 pipeline-groovy-lib:689.veec561a_dee13 pipeline-input-step:477.v339683a_8d55e pipeline-milestone-step:111.v449306f708b_7 pipeline-model-api:2.2151.ve32c9d209a_3f pipeline-model-definition:2.2151.ve32c9d209a_3f pipeline-model-extensions:2.2151.ve32c9d209a_3f pipeline-rest-api:2.34 pipeline-stage-step:305.ve96d0205c1c6 pipeline-stage-tags-metadata:2.2151.ve32c9d209a_3f pipeline-stage-view:2.34 pipeline-utility-steps:2.16.0 plain-credentials:143.v1b_df8b_d3b_e48 plugin-util-api:3.6.0 popper-api:1.16.1-3 popper2-api:2.11.6-4 powershell:2.1 promoted-builds:936.va_571a_a_b_f8da_5 pubsub-light:1.18 rebuild:320.v5a_0933a_e7d61 resource-disposer:0.23 run-condition:1.7 saml:4.429.v9a_781a_61f1da_ scm-api:683.vb_16722fb_b_80b_ script-security:1275.v23895f409fb_d service-fabric:1.6 shelve-project-plugin:3.2 snakeyaml-api:2.2-111.vc6598e30cc65 ssh:2.6.1 ssh-agent:346.vda_a_c4f2c8e50 ssh-credentials:308.ve4497b_ccd8f4 ssh-slaves:2.916.vd17b_43357ce4 ssh2easy:1.6 sshd:3.312.v1c601b_c83b_0e stashNotifier:1.439.v202358346a_7d strict-crumb-issuer:2.1.1 structs:325.vcb_307d2a_2782 synopsys-coverity:3.0.3 thinBackup:1.18 timestamper:1.26 token-macro:384.vf35b_f26814ec trilead-api:2.84.v72119de229b_7 uno-choice:2.8.1 variant:60.v7290fc0eb_b_cd windows-azure-storage:386.v673495b0a5de windows-slaves:1.8.1 workflow-aggregator:596.v8c21c963d92d workflow-api:1283.v99c10937efcb_ workflow-basic-steps:1042.ve7b_140c4a_e0c workflow-cps:3806.va_3a_6988277b_2 workflow-cps-global-lib:609.vd95673f149b_b workflow-durable-task-step:1289.v4d3e7b_01546b_ workflow-job:1360.vc6700e3136f5 workflow-multibranch:756.v891d88f2cd46 workflow-scm-step:415.v434365564324 workflow-step-api:639.v6eca_cd8c04a_a_ workflow-support:865.v43e78cc44e0d ws-cleanup:0.45 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Controller: Ubuntu 22.04.3 LTS Agent: Windows Server 2019 Datacenter Agent: Ubuntu 22.04.3 LTS

Reproduction steps

  1. Have the plugin spin up any VM gallery image using any Idle Retention Strategy
  2. Wait about 2 hours and the cleanup task will remove the VM for no reason
  3. Have to watch Jenkins and Pull Request runs from the multibranch pipeline and stop/restart them as the Vms agents pop offline during builds###

Expected Results

The VMs stay connected for the Idle Retention Strategy choose

Actual Results

VMs disconnect at various intervals and I have to baby sit builds all day not a great use of my time.

Anything else?

Things have gotten better after I switched to "Azure VM Idle Retention Strategy" from "Azure VM Pool Retention Strategy (Experimental)"

Previously it was happening more often like every 20 to 30 minutes now it happens every 2 hours.

I opened a support case with Microsoft and they looked at the back end to see what was going on. I put a delete policy lock on the resource group level to allow them to see what is trying to delete the VMs. It was the application we setup for this integration the app ID lines up with what he is seeing in the logs.

This used to work fine but not it has been broken for a while. One other thing to note is I had to do a restore of this VM from snapshot due to a delete issue with the files for Jenkins. Not sure how that would cause this to happen though. I think it is more likely it is a new bug in this plugin.

limeman40 commented 10 months ago

I hope someone will look into this. This issue is causing me to have to baby sit Jenkins builds all day

limeman40 commented 10 months ago

Can you please tell me how the cleanup process works? I am seeing this in the activity logs:

"properties": { "title": "Down: Virtual machine has been unavailable for 15 minutes", "details": "Unknown", "currentHealthStatus": "Unavailable", "previousHealthStatus": "Unavailable", "type": "Downtime", "cause": "UserInitiated"

I feel like there some timeout that is 15 minutes and it is cleaning up VMs that are still in use.

timja commented 10 months ago

These are the three tasks that are run: https://github.com/search?q=repo%3Ajenkinsci%2Fazure-vm-agents-plugin%20AsyncPeriodic&type=code

Code shouldn't be too hard to follow, the first and third will be the most interesting I think.

limeman40 commented 10 months ago

Sure but have any ideas what could cause this race condition.

My co-worker had this thought “it wouldn't be the quotas im wonder if we are hitting the cap set in the plugin”

is there any caps set in the plugin. I feel like without more guidance I am looking for a needle in a haystack.

I do have some Java experience but it has been a bit since I coded anything in it.

limeman40 commented 10 months ago

It is defiantly something with the plugin. I have a ticket open with Azure on this issue. They have seen in the logs the application ID asking for the VM to be deleted is the app ID we setup for this integration into Azure. I am possible look into this next week but its been a while since I have done anything in Java and am unsure of whta I will be able to work out

limeman40 commented 10 months ago

Still have no idea on this.. It seems to be some kind of race condition.. Something with the cleanup aspect of the code.. VMs only typically stay connected for about 2 hours tops and then things go sideways..

I am surprised nobody else has run into this issue.

timja commented 10 months ago

Unsure, we would sometimes have ours up for many hours and definitely don't hit an issue like this

limeman40 commented 10 months ago

The only way I been to collateral anything is I see message in the activity logs in Azure health events saying the VM was not around for 10 to 15 minutes.

However I am not seeing anything in the plugin logs says it will remove that VM so I am not sure what is happening. Is there anything besides the cleanup functions that could cause this.

It is a shame we have used this plugin for probably 2 years without issue and now all the sudden there some problem.

I have even tried to pull the plugin out completely and put it back and the issue persists.

timja commented 10 months ago

You could maybe add logging here: https://github.com/jenkinsci/azure-vm-agents-plugin/blob/d35d6b366b733b5475a354d0f39815051bbecf04/src/main/java/com/microsoft/azure/vmagent/remote/AzureVMAgentSSHLauncher.java#L247-L250

to see why it's closing.

Is there anything in the agent log (may be hard to get)?


Moon shot but maybe an inbound agent would work better? They should be more resilient.

limeman40 commented 10 months ago

Can you give me a few more details on how I would add logging to this section?

Could you give me an example of what this would look like code wise? I guess I can lookup how to create an HPI out of my changes.

Also are you suggesting I use a JNLP connection instead? I can try it tomorrow and see if I have better luck just making sure I understand what you are suggesting.

timja commented 10 months ago

Can you give me a few more details on how I would add logging to this section?

Could you give me an example of what this would look like code wise? I guess I can lookup how to create an HPI out of my changes.

Add something similar to https://github.com/jenkinsci/azure-vm-agents-plugin/blob/d35d6b366b733b5475a354d0f39815051bbecf04/src/main/java/com/microsoft/azure/vmagent/remote/AzureVMAgentSSHLauncher.java#L261C13-L261C89 above line 248, then run mvn clean install -P quick-build the hpi will be in the target directory.

Also are you suggesting I use a JNLP connection instead? I can try it tomorrow and see if I have better luck just making sure I understand what you are suggesting.

Yes, JNLP although it has been renamed to Inbound agent. Example init scripts are in these folders: https://github.com/jenkinsci/azure-vm-agents-plugin/tree/master/docs/init-scripts

limeman40 commented 10 months ago

Sorry little confused on inbound agent I would think if you tell it you want it to use inbound agent it would like just use that init script to set it up.

I will give both things a try to day and see if I can gain more details on this issue.

timja commented 10 months ago

Sorry little confused on inbound agent I would think if you tell it you want it to use inbound agent it would like just use that init script to set it up.

I will give both things a try to day and see if I can gain more details on this issue.

Maybe it could, currently it's setup to be quite flexible so you can configure the agent however you like and an example is given to make it easy for you to setup.

limeman40 commented 9 months ago

Please correct me if I am wrong but I would think selecting this option would automatically have it just use the PS1 scripts to connect right? If you tell it to use SSH it does all that script init stuff for you:

Screenshot from 2023-11-30 14-16-59

Is this not the case if I choose this option. I am a little confused and need more details on how to property setup Inbound connections for this.

timja commented 9 months ago

No inbound is more complicated as Jenkins doesn't reach out to your agent at all. The help for the launch method should explain it more.

The init script is uploaded to a storage account and the run on VM startup and either that or something in the VM image needs to do things like include the remoting jar file and a service.

Are you using Windows agents btw? (just from looking at that screenshot), I don't have much experience with them, although the Jenkins project does use them quite a lot without this issue as far I know, although I think they use them as 'one-shot' agents and don't do multiple builds on them

limeman40 commented 9 months ago

I have both windows and linux agents. Most of my stuff runs on windows.

I had another question so I choose "Idle Retention Strategy" per your suggestion. However even if I tell it 0 for timeout the VMs only last around 2 hours and still just gets removed.

Is it also possible there is something wrong with the way this timeout is being set in the code?

Currently we are using a mix of both windows agents and linux agents being setup via SSH connection. All was working fine for 2 years now all the sudden this issue has come up and I can not for the life of me figure out why.

I have looked through the Activity logs in Azure as well as the various logs in Jenkins it is not showing me much as to why this happens.

If you have any other ideas please let me know I am at a loss right now. It is also getting tiresome having to baby site the Jenkins server all day long.

timja commented 9 months ago

0 will mean it won't go idle: https://github.com/jenkinsci/azure-vm-agents-plugin/blob/master/src/main/java/com/microsoft/azure/vmagent/AzureVMCloudRetensionStrategy.java#L77

You would see this log line anyway: https://github.com/jenkinsci/azure-vm-agents-plugin/blob/master/src/main/java/com/microsoft/azure/vmagent/AzureVMCloudRetensionStrategy.java#L88-L89

You should be able to get help from MS Support on this as these are officially supported (for another 2 months anyway)

limeman40 commented 9 months ago

As I have previously stated I have a support ticket open. They have looked through the logs on their end they are saying the Enterprise Application ID that is asking for the VMs deletion is the one we have setup as the service principle for this plugin.

The support person has agreed to have a Teams meeting with me to discuss this issue.

You are correct it is curious to me that I am not seeing it saying it going to cleanup a VM in the logs in Jenkins in the plugin. So I am not totally believing what support is saying.

I am going to try to get a inbound agent configuration working today and see if performs any better. I have not gotten around to putting a try or catch around the function call. I will see if I can do this today.

limeman40 commented 9 months ago

I tried everything to get this inbound connection setup nothing works.

Would it be possible for you to please try the steps on your end figure out what they are and report back. Nothing I am trying is working I am started to get very frustrated with this issue.

Also what about this issue 484. This seems kind of related to my issue possibly.

timja commented 9 months ago

have you tried logging into the VM that's connecting to the Jenkins controller as an inbound agent? Linux agents log to this path: https://github.com/jenkinsci/azure-vm-agents-plugin/blob/master/docs/init-scripts/linux-inbound-agent.sh#L38

Basically you:

  1. set the VM template to inbound agent
  2. Configure an init script with this: https://raw.githubusercontent.com/jenkinsci/azure-vm-agents-plugin/master/docs/init-scripts/linux-inbound-agent.sh

And log in to the agent before it gets deleted and check the logs if any issue


i don't think #484 is related but unsure without looking closer.

limeman40 commented 9 months ago

@timja Not sure if you saw my note a couple weeks back.

We had an issue where I deleted some files needed for Jenkins and I did a restore on the Jenkins Controller from an Azure snapshot it all seems to work fine.

However could this have caused any issues with this plugin? I have even tried to pull out the plugin completly and readd it back in. Just curious if this could have any bearing on the issue I am facing.

Just trying to go over all avenues to try to figure out what is going on here.

timja commented 9 months ago

Not really sure, maybe another plugin or library could be conflicting.

You could try updating all plugins / removing some, only a guess though based on never having seen this before

limeman40 commented 9 months ago

I tried to reinstall each plugin via the HPI file that seems to not have done anything this evening I see this error in the log which is perplexing to me:

[a1be7590-9, L:/10.188.0.7:41090 - R:management.azure.com/4.150.241.10:443] The connection observed an error, the request cannot be retried as the headers/body were sent io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer

My setup uses NGNIX to reverse proxy wondering if this is an error from NGNIX

limeman40 commented 9 months ago

I am still really confused why the plugin would tell it to spin down a VM when it has not even reached what I have set for the time the VM should stay up:

"properties": { "title": "Stopping and deallocating", "details": "This virtual machine is stopped and deallocated as requested by an authorized user or process.", "currentHealthStatus": "Available", "previousHealthStatus": "Available", "type": "Downtime", "cause": "UserInitiated" }, "relatedEvents": [] }

You can see here I am telling it to keep the Vms up for 48 hours yet it removes them anyway:

Screenshot from 2023-12-08 16-13-56

It really seems like from my research the the ticket I have open with Azure support it is the plugin telling it to spin down these Vms even though I have told them to stay up for 2 days. I would really like some help figuring this out.

limeman40 commented 9 months ago

Forget it we just going to stop using this plugin