microsoft / azure-pipelines-agent

Azure Pipelines Agent 🚀
MIT License
1.7k stars 855 forks source link

[BUG] VMSS extension Microsoft.Azure.DevOps.Pipelines.Agent fails with Permission denied ./env.sh #4699

Open wkostn opened 3 months ago

wkostn commented 3 months ago

Describe your question

We are running into an issue on our self-hosted vmss agent running on Ubuntu 22.04 CIS image. The extension Microsoft.Azure.DevOps.Pipelines.Agent no longer works since release v3.236.1

We already tried to change the folder permissions in the bootcmd on the ./agent folder without any luck.

VM has reported a failure when processing extension 'Microsoft.Azure.DevOps.Pipelines.Agent' (publisher 'Microsoft.VisualStudio.Services' and type 'TeamServicesAgentLinux'). Error message: 'The Extension failed to execute: Pipeline script execution failed with exit code 100. 2024-03-11 09:49:43 version 15 2024-03-11 09:49:43 Url is https://XXXX.visualstudio.com/ 2024-03-11 09:49:43 Pool is xxx 2024-03-11 09:49:43 RunArgs is 2024-03-11 09:49:43 Directory is /agent 2024-03-11 09:49:43 Creating AzDevOps account 2024-03-11 09:49:44 Giving AzDevOps user access to the '/home' directory 2024-03-11 09:49:44 Zipfile is /agent/vsts-agent-linux-x64-3.236.1.tar.gz 2024-03-11 09:49:44 Unzipping agent 2024-03-11 09:49:47 2024-03-11 09:49:48 Installing dependencies ++ id -u + user_id=0 + '[' 0 -ne 0 ']' + '[' -e /etc/os-release ']' + filepath=/etc/os-release + '[' -e /etc/os-release ']' + echo '--------OS Information--------' + cat /etc/os-release + echo ------------------------------ + '[' -e /etc/debian_version ']' + echo 'The current OS is Debian based' + echo '--------Debian Version--------' + cat /etc/debian_version + echo ------------------------------ + command -v apt + '[' 0 -eq 0 ']' + apt update WARNING: apt does not have a stable CLI interface. Use with caution in scripts. + apt install -y libkrb5-3 zlib1g debsums WARNING: apt does not have a stable CLI interface. Use with caution in scripts. debconf: unable to initialize frontend: Dialog debconf: (TERM is not set, so the dialog frontend is not usable.) debconf: falling back to frontend: Readline debconf: unable to initialize frontend: Readline debconf: (This frontend requires a controlling tty.) debconf: falling back to frontend: Teletype dpkg-preconfigure: unable to re-open stdin: + apt install -y liblttng-ust1 WARNING: apt does not have a stable CLI interface. Use with caution in scripts. debconf: unable to initialize frontend: Dialog debconf: (TERM is not set, so the dialog frontend is not usable.) debconf: falling back to frontend: Readline debconf: unable to initialize frontend: Readline debconf: (This frontend requires a controlling tty.) debconf: falling back to frontend: Teletype dpkg-preconfigure: unable to re-open stdin: + '[' 0 -ne 0 ']' + apt install -y libssl3 WARNING: apt does not have a stable CLI interface. Use with caution in scripts. debconf: unable to initialize frontend: Dialog debconf: (TERM is not set, so the dialog frontend is not usable.) debconf: falling back to frontend: Readline debconf: unable to initialize frontend: Readline debconf: (This frontend requires a controlling tty.) debconf: falling back to frontend: Teletype dpkg-preconfigure: unable to re-open stdin: + '[' 0 -ne 0 ']' + apt install -y libicu70 WARNING: apt does not have a stable CLI interface. Use with caution in scripts. + '[' 0 -ne 0 ']' + echo ----------------------------- + echo ' Finish Install Dependencies' + echo ----------------------------- 2024-03-11 09:50:13 Dependencies installation succeeded 2024-03-11 09:50:13 Configuring build agent 2024-03-11 09:50:13 Configuring agent 2024-03-11 09:50:14 touch: cannot touch '.env': Permission denied ./env.sh: line 40: .path: Permission denied ./env.sh: line 35: .env: Permission denied Unhandled exception. System.UnauthorizedAccessException: Access to the path '/agent/_diag' is denied. ---> System.IO.IOException: Permission denied --- End of inner exception stack trace --- at System.IO.FileSystem.CreateDirectory(String fullPath) at System.IO.Directory.CreateDirectory(String path) at Microsoft.VisualStudio.Services.Agent.HostTraceListener..ctor(String logFileDirectory, String logFilePrefix, Int32 pageSizeLimit, Int32 retentionDays) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 35 at Microsoft.VisualStudio.Services.Agent.HostContext..ctor(HostType hostType, String logFile) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostContext.cs:line 135 at Microsoft.VisualStudio.Services.Agent.Listener.Program.Main(String[] args) in /mnt/vss/_work/1/s/src/Agent.Listener/Program.cs:line 28 /agent/config.sh: line 93: 3233 Aborted ./bin/Agent.Listener configure "$@" 2024-03-11 09:50:14 Build agent configuration failed '. More information on troubleshooting is available at https://aka.ms/vmextensionlinuxtroubleshoot.

Bootcmd: Fix to make sure that the agent first runs the cloud-final.service before giving the status up and running

Versions

Environment type (Please select at least one enviroment where you face this issue)

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Operation system

Ubuntu 22.04

Version controll system

Azure Devops

Azure DevOps Server Version (if applicable)

No response

revaido commented 3 months ago

Just adding we seem to be seeing the same problem with our VMSS agents which are also CIS Level 1 images in a VMSS although we're using RHEL 8 as per following spec:

                "publisher": "center-for-internet-security-inc",
                "offer": "cis-rhel",
                "sku": "cis-redhat8-l1-gen1",
                "version": "latest"

Looking back at our logs the errors began last Tuesday at 15:16 and it wasn't picked up on sooner here as we had 2 working agents.

Otherwise pretty much the exact same configs as you are also using above.

After running the three commands earlier I can confirm that the agent comes online here also (just ensure you're running via sudo etc).

Thank-you for the fix commands above, hopefully the next agent version will fix the issue we've seen here.

revaido commented 3 months ago

Just to add, I updated our VMSS to use the previous version of the CIS L1 RHEL 8 image that has been stable for us (3.0.2) and the same error occurs when it tries to deploy the agent so this is definitely something wrong in the latest agent version as we see same behaviour on 3.0.2 and 3.0.3 (latest).

JeffreyH89 commented 3 months ago

We are experiencing the same issue on a Ubuntu 22.04 LTS image. Image details: "name": "cis-ubuntu-linux-2204-l1", "product": "cis-ubuntu-linux-2204-l1", "publisher": "center-for-internet-security-inc"

DmitriiBobreshev commented 3 months ago

Hi @wkosten1982, thank you for the feedback. it's a known issue which is currently being investigated. We'll try to keep you up to date!

JonRlofty commented 3 months ago

We have seen this issue over the last week on all of our Ubuntu based VMSS.

After investigation we were able to narrow down the cause on our systems to a failure in creating a AzDevOps group. So the enableagent.sh script runs:

chown -R AzDevOps:AzDevOps $dir

It appears to fail silently (nothing in the script log). Thus the AzDevOps user doesn't have permissions to modify the agent folder.

We were beginning to test mitigating this with a Custom Script extension that created the group prior to the Pipelines.Agent extension running.

It appears however that Microsoft have rolled back the associated version of the tarball to 3.236.0 and this issue has gone away.

wkostn commented 3 months ago

@JonRlofty thanks for the update! I have just deployed a new version to see if the previous version is used. The failing extension version it was using was https://vstsagentpackage.azureedge.net/agent/3.236.1/vsts-agent-linux-x64-3.236.1.tar.gz

JonRlofty commented 3 months ago

Yep, at about midday today our extension rolled back to 3.236.0 and all our issues went away

image

revaido commented 3 months ago

Thanks for that update Jon - I'm guessing UK South looking at your naming convention?

We're still getting https://vstsagentpackage.azureedge.net/agent/3.236.1/vsts-agent-linux-x64-3.236.1.tar.gz over in West Europe but fingers crossed if that's replicating out the changes we'll also pick that up soon!

merlynomsft commented 3 months ago

Thank you for the reports. We updated the configuration so the extension will reference https://vstsagentpackage.azureedge.net/agent/3.236.0/vsts-agent-linux-x64-3.236.0.tar.gz across all instances. The change is being propagated and should reach all hosts within 12 hours. We are working on isolating the root cause for this issue and will provide updates as they are available. Please let us know if you continue to experience issues.

wkostn commented 3 months ago

@merlynomsft , thank you for this feedback.

The updated configuration is not propagated to our environment (yet), but I will retry by uninstalling the extension from the vmss every hour. This should be sufficient right? image

JeffreyH89 commented 3 months ago

Hello,

Thank you for the update. This would be a good time to re-consider whether to use pre-release versions for the Azure DevOps agent extensions. We as a customer have no influence in what agent version is used. We've had unusable Linux pools since last week.

wkostn commented 3 months ago

Thank you for the reports. We updated the configuration so the extension will reference https://vstsagentpackage.azureedge.net/agent/3.236.0/vsts-agent-linux-x64-3.236.0.tar.gz across all instances. The change is being propagated and should reach all hosts within 12 hours. We are working on isolating the root cause for this issue and will provide updates as they are available. Please let us know if you continue to experience issues.

I redeployed a new scaleset several times but it is still using the latest version and is failing, even just now. Is there some kind of caching that is causing this problem @merlynomsft ?

revaido commented 3 months ago

Same here in West Europe region, the Microsoft.Azure.DevOps.Pipelines.Agent extension is still pointing to https://vstsagentpackage.azureedge.net/agent/3.236.1/vsts-agent-linux-x64-3.236.1.tar.gz for now and the new VMSS build agents all failing to deploy as a result.

Maybe have some logic in the code where after three consecutive failures there is an additional fallback URL to the previous agent tarball version which is then attempted (or even make the target URL / version user definable so we can change if needs be).

Without manual intervention of the commands @wkostn had in the first post we'd have a situation with no build agents available at all.

JonRlofty commented 3 months ago

I am assuming that this rollback isn't controlled by region but by Azure DevOps Organisation. We have different Organisations in the same tenant and region who are seeing different versions of the vsts agent being applied.

revaido commented 3 months ago

@merlynomsft We're still seeing v3.236.1 here unfortunately in the extension configs so no joy with our VMSS just yet.

JeffreyH89 commented 3 months ago

Good morning,

The change still hasn't rolled out yet to our organization (West Europe). Any ETA available?

Thank you

revaido commented 3 months ago

@merlynomsft As JeffreyH89 says above we're still seeing the problem here in WE with it still pointing to https://vstsagentpackage.azureedge.net/agent/3.236.1/vsts-agent-linux-x64-3.236.1.tar.gz

wkostn commented 3 months ago

As a temporary workaround you can run the following script to patch the extension version. Note that this is a temporary solution as the extension version will be overwritten after a few hours. But if you keep a few agents standby/idle they won't restart and keep running with this extension version.

Script to run in cloudshell:

$ss = Get-AzResource -Id "<vmss resource id>" -ExpandProperties
$ss.Properties.virtualMachineProfile.extensionProfile.extensions.properties.settings[0].agentDownloadUrl = 'https://vstsagentpackage.azureedge.net/agent/3.236.0/vsts-agent-linux-x64-3.236.0.tar.gz'
$ss | Set-AzResource
revaido commented 3 months ago

That's brilliant thank-you - I'd assumed that was a read only setting so that's great news. That comment needs a bigger thumbs up button @wkostn :)

We've stuck a support request in as well to see what comes back that route but that's really helpful thank-you - made a note of that in case it ever happens again.

In case anyone else tries and getting the operation failed because the resource is in the Failed state just delete the instances and run it quickly and it should update the setting.

merlynomsft commented 3 months ago

The mitigation to pin 3.236.0 has been rolled out to all affected organizations, and you should no longer see the configuration overwritten if you apply the mitigation above.

We are very sorry for the delay and for this issue in the first place. We're monitoring all hosts and we are monitoring usage of 3.236.0 vs 3.236.1 to ensure mitigation is fully applied to all hosts. The delay in the mitigation occurred due to a mismatch between our configuration store and the code that reads the agent URL setting; we have filed a bug and we will fix that soon for future mitigations. We are continuing to troubleshoot, and root cause this failure that caused this issue.

Thank you so much @wkostn for posting the workaround and everyone else for the reports and confirmations. Please continue to post with any issues.

merlynomsft commented 3 months ago

We wanted to provide you with an update on this issue.  We are updating enableagent.sh (the script that the Azure Devops VMSS extension runs on the VM to configure the agent) to be more robust and add additional tracing. The updated script will be more robust to accommodate customized images and Custom Script Extensions. While are still looking into what triggered this issue with Agent 3.236.1 deployment, we believe the improvements to enableagent.sh script will prevent this issue going forward.

Examples of updates we’re making to enableagent.sh:

Thank you again -- we will keep you posted as we make progress.

revaido commented 2 months ago

Morning @merlynomsft, I hope you're well.

I noticed that v3.238.0 was made latest a couple of weeks ago looking at the release history, but our VMSS agents are still pointing to v3.236.0 after the mitigations made to work around the previous v3.236.1 issue and so are coming up with v3.236.0

Do we need to do anything manually at our end to update the config or should the new release get pushed to our extension configs soon automatically please in Azure?

With thanks in advance

darrenhull commented 1 month ago

i am still getting this issue with version 3.239.1 on a Linux Ubuntu CIS level 1 Image. Using the above workaround to rollback worked.

pranavkarthik223 commented 5 days ago

Hello @darrenhull . We are facing the same issue with 3.240.1. Could you please share the previous version that doesn't have this error? so that we can roll back using the workaround.

darrenhull commented 3 days ago

Hello @darrenhull . We are facing the same issue with 3.240.1. Could you please share the previous version that doesn't have this error? so that we can roll back using the workaround.

The answer by @wkostn is above:

$ss = Get-AzResource -Id "" -ExpandProperties $ss.Properties.virtualMachineProfile.extensionProfile.extensions.properties.settings[0].agentDownloadUrl = 'https://vstsagentpackage.azureedge.net/agent/3.236.0/vsts-agent-linux-x64-3.236.0.tar.gz' $ss | Set-AzResource