microsoft / azure-pipelines-agent

Azure Pipelines Agent 🚀
MIT License
1.73k stars 869 forks source link

Agent.Listener consumes 100% CPU on Linux during builds #454

Closed ppanyukov closed 7 years ago

ppanyukov commented 8 years ago

Agent version and platform

Version of your agent? 2.104.0
OS of the machine running the agent?

Linux. Two setups:


VSTS type and version

VisualStudio.com

What's not working?

When I run a build, I notice that Agent.Listener process is at 100% CPU almost all the time.

Very surprising, surely should not happen?

Here are some stats.

The build ran for 35 minutes (in our case this is normal time).

screen shot 2016-07-27 at 17 47 39

Here you see that Agent.Listener consumed 25 minutes of CPU during the build time, implying it was at 100% most of the time (if my interpretation of this is correct).

screen shot 2016-07-27 at 17 46 21

During the build the top command showed pretty much this picture:

top - 16:00:25 up  3:22,  2 users,  load average: 1.96, 1.98, 1.41
Tasks: 344 total,   2 running, 342 sleeping,   0 stopped,   0 zombie
%Cpu(s): 52.5 us, 47.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem :  7134548 total,   324348 free,   390568 used,  6419632 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6331304 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
57256 philip    20   0 3342472  76984  36944 S  98.7  1.1  14:07.53 Agent.Listener
35561 philip    20   0  105248  81984  10824 R  12.0  1.1   0:00.36 cc1plus
    7 root      20   0       0      0      0 S   3.3  0.0   0:08.94 rcu_sched
57332 philip    20   0 3613416  80888  37080 S   1.7  1.1   0:28.16 Agent.Worker

I did an strace on it, and it showed things like this:

$ sudo strace -s 99 -ffp 57256

[pid 57279] poll([{fd=127, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652366193}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652401693}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652428994}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652495895}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652525995}) = 0
[pid 57279] poll([{fd=127, events=POLLIN}, {fd=121, events=POLLIN}], 2, 1000) = 1 ([{fd=121, revents=POLLIN}])
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652607496}) = 0
[pid 57279] poll([{fd=127, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652674297}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652710097}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652737998}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652768598}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652806199}) = 0
[pid 57279] poll([{fd=127, events=POLLIN}, {fd=121, events=POLLIN}], 2, 1000) = 1 ([{fd=121, revents=POLLIN}])
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652878200}) = 0
[pid 57279] poll([{fd=127, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)

.... (very fast scrolling list of lines like above)

Looks like there are a lot of calls to clock_gettime.

When I ran strace with counters for a few seconds, it showed this:

$ sudo strace -s 99 -c -f -p 57256

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 89.39   28.184496        5627      5009       957 futex
 10.52    3.316000      829000         4         2 restart_syscall
  0.08    0.023929           0    405812           poll
  0.02    0.005094           0   1219155           clock_gettime
  0.00    0.000017           0       834           mprotect
  0.00    0.000000           0         4           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         2           open
  0.00    0.000000           0         4           close
  0.00    0.000000           0         2           socket
  0.00    0.000000           0         2         2 connect
  0.00    0.000000           0         1           sendto
  0.00    0.000000           0         2           recvfrom
  0.00    0.000000           0       435           wait4
  0.00    0.000000           0         4           fcntl
  0.00    0.000000           0        95           gettimeofday
  0.00    0.000000           0        91           getrusage
  0.00    0.000000           0         2           getuid
  0.00    0.000000           0         2           getppid
  0.00    0.000000           0       419           gettid
------ ----------- ----------- --------- --------- ----------------
100.00   31.529536               1631880       961 total

In a few seconds we accumulated 1,219,155 to clock_gettime and 405,812 calls to poll.

I'm not sure about poll but so many syscalls to clock_gettime does not feel right?

The above was on Centos7 container on Ubuntu Xenial host setup.

Running Centos7 container on Centos7 host showed similar pattern, but looked less severe -- Agent.Listener consumed 11 minutes of CPU during 30 minutes build.

Is there some busy loop somewhere that should not be? Is it a bug in .NET CLR? Is it due to containerisation?


Agent and Worker's diag log

There is nothing in the logs that looks relevant.

TingluoHuang commented 8 years ago

@ppanyukov might be a dotnet core's problem. There is a issue in donet/corefx repo looks similar, https://github.com/dotnet/corefx/issues/9751

ppanyukov commented 8 years ago

@TingluoHuang is that correct issue? Does not look related.

TingluoHuang commented 8 years ago

https://github.com/dotnet/corefx/issues/9751 my bad.

ppanyukov commented 8 years ago

Looks pretty much spot on. The poll syscalls in my trace are for fd 127 which is a socket to our company.visualstudio.com.

TingluoHuang commented 8 years ago

@ppanyukov I would like link your issue to donet core guys, I want let them know that the issue is affecting the vsts-agent.

ppanyukov commented 8 years ago

@TingluoHuang definitely. It would be nice to fix. And anything I can help with, repro, debugging etc -- let me know, will be glad to help.

BTW, if you look at the issue now you will it's already linked (unless you meant different kind of link).

TingluoHuang commented 8 years ago

@ppanyukov i will leave a message there to let them know. :)

TingluoHuang commented 8 years ago

Close since https://github.com/dotnet/corefx/issues/9751 been resolved.

ppanyukov commented 8 years ago

@TingluoHuang Do you know when the fix from corefx will be available in the agent? Unless it's already done?

TingluoHuang commented 8 years ago

@ppanyukov it will take a well, we won't consume latest corefx until they have a plan of shipping 1.1.0.

ppanyukov commented 8 years ago

@TingluoHuang I guess you meant 1.2.0, not 1.1.0? If so, I see that it's Due by April 30, 2017, which is 6 months away, is that correct?

In any case, even though this is external issue, since the issue with the agent is actually not resolved yet, can we perhaps keep this issue open until such time that the agent ships with the fixed corefx?

TingluoHuang commented 8 years ago

reopen for tracking.

TingluoHuang commented 7 years ago

2.114.0 agent is build on netcore 1.1 which contains the fix.