microsoft / azure-pipelines-agent

Azure Pipelines Agent 🚀
MIT License
1.72k stars 866 forks source link

Linux agent went suddenly offline #1614

Closed carct closed 6 years ago

carct commented 6 years ago

Hi,

Today, out of the blue, the linux vsts-agent just went offline. The VM didn't seem to be the problem as I was able to SSH into it (Azure VM).

What I tried:

The reconfiguration went through and the agent does show up in the proper Agent Pool, but with offline status.

When trying ./svc.sh status i get output: [root@vstsagent VSTSAgent]# ./svc.sh status

/etc/systemd/system/vsts.agent.blabla.vstsagent.service ● vsts.agent.blabla.vstsagent.service - VSTS Agent (blabla.vstsagent) Loaded: loaded (/etc/systemd/system/vsts.agent.blabla.vstsagent.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2018-06-08 15:42:01 EEST; 3min 21s ago Main PID: 916 (runsvc.sh) CGroup: /system.slice/vsts.agent.blabla.vstsagent.service ├─916 /bin/bash /VSTSAgent/runsvc.sh └─921 ./externals/node/bin/node ./bin/AgentService.js

Jun 08 15:45:11 vstsagent runsvc.sh[916]: <Buffer 53 63 61 6e 6e 69 6e 67 20 66 6f 72 20 74 6f 6f 6c 20 63 61 70 61 62 69 6c 69 74 69 65 73 2e 0a> Jun 08 15:45:12 vstsagent runsvc.sh[916]: <Buffer 41 6e 20 65 72 72 6f 72 20 6f 63 63 75 72 72 65 64 3a 20 41 63 63 65 73 73 20 74 6f 20 74 68 65 20 70 61 74 68 20 27 2f 56 53 54 53 41 67 65 6e 74 2f ... > Jun 08 15:45:12 vstsagent runsvc.sh[916]: Agent listener exited with error code 2 Jun 08 15:45:12 vstsagent runsvc.sh[916]: Agent listener exit with retryable error, re-launch agent in 5 seconds. Jun 08 15:45:17 vstsagent runsvc.sh[916]: Starting Agent listener Jun 08 15:45:17 vstsagent runsvc.sh[916]: started listener process Jun 08 15:45:18 vstsagent runsvc.sh[916]: <Buffer 53 63 61 6e 6e 69 6e 67 20 66 6f 72 20 74 6f 6f 6c 20 63 61 70 61 62 69 6c 69 74 69 65 73 2e 0a> Jun 08 15:45:19 vstsagent runsvc.sh[916]: <Buffer 41 6e 20 65 72 72 6f 72 20 6f 63 63 75 72 72 65 64 3a 20 41 63 63 65 73 73 20 74 6f 20 74 68 65 20 70 61 74 68 20 27 2f 56 53 54 53 41 67 65 6e 74 2f ... > Jun 08 15:45:19 vstsagent runsvc.sh[916]: Agent listener exited with error code 2 Jun 08 15:45:19 vstsagent runsvc.sh[916]: Agent listener exit with retryable error, re-launch agent in 5 seconds.

Went to _diag folder and latest log states this: [2018-06-08 12:54:53Z INFO AgentCapabilitiesProvider] Adding 'Agent.Name': 'vstsagent' [2018-06-08 12:54:53Z INFO AgentCapabilitiesProvider] Adding 'Agent.OS': 'Linux' [2018-06-08 12:54:53Z INFO AgentCapabilitiesProvider] Adding 'InteractiveSession': 'False' [2018-06-08 12:54:53Z INFO AgentCapabilitiesProvider] Adding 'Agent.Version': '2.134.2' [2018-06-08 12:54:53Z INFO AgentCapabilitiesProvider] Adding 'Agent.ComputerName': 'vstsagent' [2018-06-08 12:54:53Z INFO HostContext] Well known directory 'Bin': '/VSTSAgent/bin' [2018-06-08 12:54:53Z INFO HostContext] Well known directory 'Root': '/VSTSAgent' [2018-06-08 12:54:53Z INFO AgentCapabilitiesProvider] Adding 'Agent.HomeDirectory': '/VSTSAgent' [2018-06-08 12:54:53Z INFO ConfigurationStore] HasCredentials() [2018-06-08 12:54:53Z INFO ConfigurationStore] stored True [2018-06-08 12:54:53Z INFO CredentialManager] GetCredentialProvider [2018-06-08 12:54:53Z INFO CredentialManager] Creating type OAuth [2018-06-08 12:54:53Z INFO CredentialManager] Creating credential type: OAuth [2018-06-08 12:54:53Z INFO RSAFileKeyManager] Loading RSA key parameters from file /VSTSAgent/.credentials_rsaparams [2018-06-08 12:54:53Z ERR Terminal] WRITE ERROR: An error occurred: Access to the path '/VSTSAgent/.credentials_rsaparams' is denied. [2018-06-08 12:54:53Z ERR AgentProcess] System.UnauthorizedAccessException: Access to the path '/VSTSAgent/.credentials_rsaparams' is denied. ---> System.IO.IOException: Permission denied --- End of inner exception stack trace --- at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func2 errorRewriter) at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode) at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options) at System.IO.StreamReader..ctor(String path, Encoding encoding, Boolean detectEncodingFromByteOrderMarks, Int32 bufferSize) at System.IO.StreamReader..ctor(String path, Encoding encoding, Boolean detectEncodingFromByteOrderMarks) at System.IO.File.InternalReadAllText(String path, Encoding encoding) at System.IO.File.ReadAllText(String path, Encoding encoding) at Microsoft.VisualStudio.Services.Agent.Util.IOUtil.LoadObject[T](String path) at Microsoft.VisualStudio.Services.Agent.Listener.Configuration.RSAFileKeyManager.GetKey() at Microsoft.VisualStudio.Services.WebApi.VssSigningCredentials.Create(Func1 factory) at Microsoft.VisualStudio.Services.Agent.Listener.Configuration.OAuthCredential.GetVssCredentials(IHostContext context) at Microsoft.VisualStudio.Services.Agent.Listener.MessageListener.d11.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.VisualStudio.Services.Agent.Listener.Agent.d8.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.VisualStudio.Services.Agent.Listener.Agent.d5.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.VisualStudio.Services.Agent.Listener.Program.d1.MoveNext()

Naturally, this pops to my eye: [2018-06-08 12:54:53Z INFO RSAFileKeyManager] Loading RSA key parameters from file /VSTSAgent/.credentials_rsaparams [2018-06-08 12:54:53Z ERR Terminal] WRITE ERROR: An error occurred: Access to the path '/VSTSAgent/.credentials_rsaparams' is denied.

So i went and changed the permissions, now there is no more error and actually a release that I had previously queued started, but I manually canceled it. Refreshed the VSTS Agent Pool page and my agent still appears as offline. Queued a new release, got warning that my agent is offline, clicked OK to go through anyway and of course it is not working and stating "Agent queue: agents-w-docker | Agent: not allocated". On The VM, this process is running: /VSTSAgent/bin/Agent.Listener run --startuptype service Also, I don't seem to have any more errors, see below output of svc.sh status command: [root@vstsagent VSTSAgent]# ./svc.sh status

/etc/systemd/system/vsts.agent.blabla.vstsagent.service ● vsts.agent.blabla.vstsagent.service - VSTS Agent (blabla.vstsagent) Loaded: loaded (/etc/systemd/system/vsts.agent.blabla.vstsagent.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2018-06-08 16:06:43 EEST; 5min ago Main PID: 3724 (runsvc.sh) CGroup: /system.slice/vsts.agent.blabla.vstsagent.service ├─3724 /bin/bash /VSTSAgent/runsvc.sh ├─3727 ./externals/node/bin/node ./bin/AgentService.js └─3733 /VSTSAgent/bin/Agent.Listener run --startuptype service

Jun 08 16:06:43 vstsagent systemd[1]: Starting VSTS Agent (blabla.vstsagent)... Jun 08 16:06:43 vstsagent runsvc.sh[3724]: .path=/usr/lib/jvm/java-1.8.0-openjdk:/opt/ant/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/git/bin:/home/wonder/.local/bin:/home/wonder/bin Jun 08 16:06:43 vstsagent runsvc.sh[3724]: Starting Agent listener Jun 08 16:06:43 vstsagent runsvc.sh[3724]: started listener process Jun 08 16:06:43 vstsagent runsvc.sh[3724]: started running service Jun 08 16:06:45 vstsagent runsvc.sh[3724]: <Buffer 53 63 61 6e 6e 69 6e 67 20 66 6f 72 20 74 6f 6f 6c 20 63 61 70 61 62 69 6c 69 74 69 65 73 2e 0a> Jun 08 16:06:45 vstsagent runsvc.sh[3724]: <Buffer 43 6f 6e 6e 65 63 74 69 6e 67 20 74 6f 20 74 68 65 20 73 65 72 76 65 72 2e 0a> Jun 08 16:06:50 vstsagent runsvc.sh[3724]: <Buffer 32 30 31 38 2d 30 36 2d 30 38 20 31 33 3a 30 36 3a 35 30 5a 3a 20 4c 69 73 74 65 6e 69 6e 67 20 66 6f 72 20 4a 6f 62 73 0a> Jun 08 16:06:52 vstsagent runsvc.sh[3724]: <Buffer 32 30 31 38 2d 30 36 2d 30 38 20 31 33 3a 30 36 3a 35 32 5a 3a 20 52 75 6e 6e 69 6e 67 20 6a 6f 62 3a 20 52 65 6c 65 61 73 65 0a> Jun 08 16:07:44 vstsagent runsvc.sh[3724]: <Buffer 32 30 31 38 2d 30 36 2d 30 38 20 31 33 3a 30 37 3a 34 34 5a 3a 20 4a 6f 62 20 52 65 6c 65 61 73 65 20 63 6f 6d 70 6c 65 74 65 64 20 77 69 74 68 20 72 ... >

In the end I managed to get in back online (green) in the Agent Pool by using: svc.sh stop, uninstall, install, start

I would very much like to know why this thing happened in the first place since nobody touched the agent or VM and how can this be avoided in the future as it caused disruption and it took a bit of sweat to solve it.

Appreciate any input on the matter and wish everyone a nice weekend ahead :)

Kindest regards, Andrei

TingluoHuang commented 6 years ago

@carct can you send your agent diag logs to me? you can send me the whole _diag folder, i would like to check and understand why the agent goes offline at the first place. my email: tihuang @ microsoft.com

TingluoHuang commented 6 years ago

@carct i have read through your logs, here is what i found. the log file name is the UTC time of the file been created.

  1. the first log file Agent_20180608-101602-utc.log shows you configure the agent successfully, and you configure the agent run as service.
  2. log file Agent_20180608-101611-utc.log shows your agent is online from [2018-06-08 10:16:17Z INFO Terminal] WRITE LINE: 2018-06-08 10:16:17Z: Listening for Jobs to [2018-06-08 11:48:59Z INFO Terminal] WRITE LINE: Exiting..., the agent get a Ctrl-C to exit. I don't know what send the Ctrl-C to the agent.
  3. log file Agent_20180608-114936-utc.log shows your agent is online from [2018-06-08 11:49:41Z INFO Terminal] WRITE LINE: 2018-06-08 11:49:41Z: Listening for Jobs to [2018-06-08 11:55:15Z INFO Terminal] WRITE LINE: Exiting..., again the agent get a Ctrl-C to exit and I don't know what send the Ctrl-C to the agent.
  4. log file Agent_20180608-115525-utc.log shows your agent is online from [2018-06-08 11:55:30Z INFO Terminal] WRITE LINE: 2018-06-08 11:55:30Z: Listening for Jobs to [2018-06-08 12:10:15Z INFO Terminal] WRITE LINE: Exiting..., again the agent get a Ctrl-C to exit and I don't know what send the Ctrl-C to the agent. 5.log file Agent_20180608-120947-utc.log shows you start unconfigure the agent.
  5. log file Agent_20180608-123732-utc.log shows you start configure the agent, but you configure the agent as ROOT, when you configure the agent through config.sh, the script should block you from configuring as root. So how did you configure the agent, why you need to configure it as ROOT?
  6. after you configure the agent, you try to configure the agent as service, but when you run ./svc.sh install you installed the service as you, so when the agent actually running, it run as you, and it doesn't have permission to read file created by ROOT.

Are you running the agent inside a docker container? otherwise i can't think any environment that will cause agent shutdown so frequently.

carct commented 6 years ago

Hi,

No special need of root, I got to use it like that as it didn't work without (I know from previous setup it actually doesn't require the root, but another user). used command: sudo ./bin/Agent.Listener configure --url https://*my-url*.visualstudio.com/ --agent myagentname --pool mypool --auth PAT --token mytoken --acceptTeeEula --work _work --nostart

In the past I have used this without root as required - too bad I didn't save the output that made my try this time with root. :(

Agent is inside a VM direclty, it isn't containerized and the VM is not crashing or anything like that. I have confirmation that nobody interacted with the VM / agent.

Btw, is it possible that the agent auto-updated without sending willingly from VSTS Portal (Agent Pool page) the updates on it ? I didn't see any update-related settings over there, just to send the update over, did i miss anything ?

Should I try to re-download and re-configure entirely the VSTS agent ? I see that over the weekend the agent is still green and online in the pool as expected.

PS: we're now experiencing some docker related issues, might be due to this set-up with root thing(?!): 2018-06-11T08:49:37.0684402Z ##[error]Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.26/build?buildargs=%7B%7D&buildbinds=null&cachefrom=%5B]&cgroupparent=&cpuperiod=0&cpuquota=0&cpusetcpus=&cpusetmems=&cpushares=0&dockerfile=Dockerfile-spec-env&labels=%7B%7D&memory=0&memswap=0&networkmode=default&rm=1&shmsize=0&t=artifactory.*mydomain*.com%3A6000%2F*myimage*%3A*myimageversion*&t=artifactory.*mydomain*.com%3A6000%2F*myimage*&ulimits=null: dial unix /var/run/docker.sock: connect: permission denied 2018-06-11T08:49:37.0785856Z ##[debug]Processed: ##vso[task.issue type=error;]Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.26/build?buildargs=%7B%7D&buildbinds=null&cachefrom=%5B%5D&cgroupparent=&cpuperiod=0&cpuquota=0&cpusetcpus=&cpusetmems=&cpushares=0&dockerfile=Dockerfile-spec-env&labels=%7B%7D&memory=0&memswap=0&networkmode=default&rm=1&shmsize=0&t=artifactory.*mydomain*.com%3A6000%2F*myimage*%3A*myimageversion*&t=artifactory.*mydomain*.com%3A6000%2F*myimage*k&ulimits=null: dial unix /var/run/docker.sock: connect: permission denied 2018-06-11T08:49:37.0879954Z ##[debug]task result: Failed

Thank you, A.

bryanmacfarlane commented 6 years ago

The agent will update if (1) invoked from portal or (2) a feature is used that demands a new agent. For example, if the service (which always moves forward) adds a feature which required changed in the agent, when it's used, we will pull the agent forward to that baseline. The automation workflow runs on the service and uses the agent worker as an extension to run nodes of the graph.

carct commented 6 years ago

ok, that would indeed explain the newer version that i saw - strange that it kinda broke it though

TingluoHuang commented 6 years ago

@carct for the docker error, if you are using the one of the inbox task, you can log issue at the task repo: https://github.com/microsoft/vsts-tasks

when you setup machine, have you follow this instruction? https://docs.docker.com/install/linux/linux-postinstall/

BTW, is your original question got answered, are we good on closing the issue?

TingluoHuang commented 6 years ago

i am going to close this issue, please reopen if you need more help.

alexvy86 commented 6 years ago

@TingluoHuang I think I experienced the same thing, with agents in two different servers. Probably relevant to say that both VSTS agents were installed through the Azure VM extension. Apparently the VSTS Agents were updated at the same time today (08/16 04:52 UTC), which is made very clear by an empty file called EXTENSIONUPDATE inside the VSTSAgent#### folders, with the same timestamp as the newest VSTSAgent#### folders.

image

I did not trigger the update manually, and I can't think of what could have happened that required a new agent feature, as described by @bryanmacfarlane . I thought the numbers at the end of the VSTSAgent folder might be a version identifier but the two folders created today are different, so maybe some sort of timestamp...?

Now, in both cases I see this at the end of the latest Agent_ log file inside the _diag folder:

[2018-08-16 04:52:31Z INFO PersonalAccessToken] EnsureCredential
[2018-08-16 04:52:31Z INFO CommandSettings] Arg 'token': ''
[2018-08-16 04:52:31Z INFO CommandSettings] Flag 'unattended': 'True'
[2018-08-16 04:52:31Z INFO PromptManager] ReadValue
[2018-08-16 04:52:31Z INFO Terminal] WRITE LINE: Failed: Removing agent from the server
[2018-08-16 04:52:32Z ERR  Agent] System.Exception: Invalid configuration provided for token. Terminating unattended configuration.
   at Microsoft.VisualStudio.Services.Agent.Listener.Configuration.PromptManager.ReadValue(String argName, String description, Boolean secret, String defaultValue, Func`2 validator, Boolean unattended)
   at Microsoft.VisualStudio.Services.Agent.Listener.CommandSettings.GetArgOrPrompt(String name, String description, String defaultValue, Func`2 validator)
   at Microsoft.VisualStudio.Services.Agent.Listener.Configuration.PersonalAccessToken.EnsureCredential(IHostContext context, CommandSettings command, String serverUrl)
   at Microsoft.VisualStudio.Services.Agent.Listener.Configuration.ConfigurationManager.GetCredentialProvider(CommandSettings command, String serverUrl)
   at Microsoft.VisualStudio.Services.Agent.Listener.Configuration.ConfigurationManager.<UnconfigureAsync>d__7.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.VisualStudio.Services.Agent.Listener.Agent.<ExecuteCommand>d__5.MoveNext()

So evidently the PAT that would allow the agent to communicate with VSTS is not there, but I imagine that's unavoidable...? I provided one to the Azure VM extension when setting up the agent, but I assume it doesn't store it. So any update to the agents will break them and require that we restart or reconfigure them manually, providing another PAT?

And if that is the proper workaround for this issue, how would I have to about it?

alexvy86 commented 6 years ago

And this is what I get if I try to start the service:

<my-user>@<my-server>:/VSTSAgent1534395152.047264$ sudo ./svc.sh start
Failed to start vsts.agent.<REDACTED>.service: Unit vsts.agent.<REDACTED>.service not found.
Failed: failed to start vsts.agent.<REDACTED>.service
bryanmacfarlane commented 6 years ago

Right, the PAT is used to interactively configure and unconfigure. It is not used to communicate with VSTS. You have a half unconfigured service - it removed it as a service but couldn't remove it from the service.

By design, we don't persist the very elevated PAT which equates to your account to register an agent.

This doesn't look like a bug in the agent - it looks like an issue with the azure VM extension if it's trying to unconfigure.

bryanmacfarlane commented 6 years ago

Registering/unregistering requires a super elevate PAT, but that interactive process downloads a limited access key/token using your authorization to listen to the queue: https://github.com/Microsoft/vsts-agent/blob/master/docs/design/auth.md

alexvy86 commented 6 years ago

That was my understanding of PAT usage. It does look like an issue with the extension, the last line in the logs of the VM Agent (which manages VM extension updates) is an error.

2018/08/16 04:52:35.286929 ERROR Event: name=Microsoft.VisualStudio.Services.TeamServicesAgentLinux, op=Enable, message=[ExtensionError] Non-zero exit code: 5, AzureRM.py -enable, duration=0

I'm trying to find a way to report it to them but can't find where, if either of you can point me in the right direction I'd appreciate it.

~Now, just so I understand... what is it that stays in the VM/Agent that allows it to communicate with VSTS once the PAT is no longer needed? Updating the agent in place is obviously possible, I assume that's what updating from the VSTS portal would do, so maybe in this case that information is still there somewhere (the .credentials and .credentials_rsaparams files?), and the fix that the VM extension needs to implement is just trying to restart the agent instead of "reconfiguring" it?~ Nevermind, found it in the link you provided. Thanks @bryanmacfarlane .

bryanmacfarlane commented 6 years ago

@TingluoHuang - do you know where to report issues with VM extension?

abhiramdas99 commented 1 year ago

And this is what I get if I try to start the service:

<my-user>@<my-server>:/VSTSAgent1534395152.047264$ sudo ./svc.sh start
Failed to start vsts.agent.<REDACTED>.service: Unit vsts.agent.<REDACTED>.service not found.
Failed: failed to start vsts.agent.<REDACTED>.service

try following command before start
sudo ./svc.sh install sudo ./svc.sh start sudo ./svc.sh status