Agent.Listener consumes 100% CPU on Linux during builds

ppanyukov commented 8 years ago

Agent version and platform

Version of your agent? 2.104.0
OS of the machine running the agent?

Linux. Two setups:

Centos7.2 container on Ubuntu Xenial
Centos7.2 container on Centos7.2

VSTS type and version

VisualStudio.com

What's not working?

When I run a build, I notice that Agent.Listener process is at 100% CPU almost all the time.

Very surprising, surely should not happen?

Here are some stats.

The build ran for 35 minutes (in our case this is normal time).

Here you see that Agent.Listener consumed 25 minutes of CPU during the build time, implying it was at 100% most of the time (if my interpretation of this is correct).

During the build the top command showed pretty much this picture:

top - 16:00:25 up  3:22,  2 users,  load average: 1.96, 1.98, 1.41
Tasks: 344 total,   2 running, 342 sleeping,   0 stopped,   0 zombie
%Cpu(s): 52.5 us, 47.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem :  7134548 total,   324348 free,   390568 used,  6419632 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6331304 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
57256 philip    20   0 3342472  76984  36944 S  98.7  1.1  14:07.53 Agent.Listener
35561 philip    20   0  105248  81984  10824 R  12.0  1.1   0:00.36 cc1plus
    7 root      20   0       0      0      0 S   3.3  0.0   0:08.94 rcu_sched
57332 philip    20   0 3613416  80888  37080 S   1.7  1.1   0:28.16 Agent.Worker

I did an strace on it, and it showed things like this:

$ sudo strace -s 99 -ffp 57256

[pid 57279] poll([{fd=127, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652366193}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652401693}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652428994}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652495895}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652525995}) = 0
[pid 57279] poll([{fd=127, events=POLLIN}, {fd=121, events=POLLIN}], 2, 1000) = 1 ([{fd=121, revents=POLLIN}])
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652607496}) = 0
[pid 57279] poll([{fd=127, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652674297}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652710097}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652737998}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652768598}) = 0
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652806199}) = 0
[pid 57279] poll([{fd=127, events=POLLIN}, {fd=121, events=POLLIN}], 2, 1000) = 1 ([{fd=121, revents=POLLIN}])
[pid 57279] clock_gettime(CLOCK_MONOTONIC, {12261, 652878200}) = 0
[pid 57279] poll([{fd=127, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)

.... (very fast scrolling list of lines like above)

Looks like there are a lot of calls to clock_gettime.

When I ran strace with counters for a few seconds, it showed this:

$ sudo strace -s 99 -c -f -p 57256

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 89.39   28.184496        5627      5009       957 futex
 10.52    3.316000      829000         4         2 restart_syscall
  0.08    0.023929           0    405812           poll
  0.02    0.005094           0   1219155           clock_gettime
  0.00    0.000017           0       834           mprotect
  0.00    0.000000           0         4           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         2           open
  0.00    0.000000           0         4           close
  0.00    0.000000           0         2           socket
  0.00    0.000000           0         2         2 connect
  0.00    0.000000           0         1           sendto
  0.00    0.000000           0         2           recvfrom
  0.00    0.000000           0       435           wait4
  0.00    0.000000           0         4           fcntl
  0.00    0.000000           0        95           gettimeofday
  0.00    0.000000           0        91           getrusage
  0.00    0.000000           0         2           getuid
  0.00    0.000000           0         2           getppid
  0.00    0.000000           0       419           gettid
------ ----------- ----------- --------- --------- ----------------
100.00   31.529536               1631880       961 total

In a few seconds we accumulated 1,219,155 to clock_gettime and 405,812 calls to poll.

I'm not sure about poll but so many syscalls to clock_gettime does not feel right?

The above was on Centos7 container on Ubuntu Xenial host setup.

Running Centos7 container on Centos7 host showed similar pattern, but looked less severe -- Agent.Listener consumed 11 minutes of CPU during 30 minutes build.

Is there some busy loop somewhere that should not be? Is it a bug in .NET CLR? Is it due to containerisation?