CITGM runs taking too long on AIX

mhdawson commented 8 years ago

@georgeadams95 is reporting that CITGM runs on AIX are taking multiple hours were as other platforms only take 30-50m.

The regular regression runs on AIX are in the 20m range which is more than linux but in line with windows/arm. So we might expect it to take a bit longer but not as longer as we are seeing.

This issue is to track our progress of narrowing down what the bottleneck might be,.

mhdawson commented 8 years ago

Snapshot of perf data while CITGM was running:

Topas Monitor for host:    power8-nodejs2       EVENTS/QUEUES    FILE/TTY
Thu Nov 17 14:41:33 2016   Interval:  2         Cswitch     383  Readch    13887
                                                Syscall     277  Writech     231
CPU  User%  Kern%  Wait%  Idle%  Physc   Entc   Reads        47  Rawin         0
ALL    1.5    2.8    0.0   95.7   0.08    7.8   Writes       45  Ttyout       76
                                                Forks         0  Igets         0
Network  KBPS   I-Pack  O-Pack   KB-In  KB-Out  Execs         0  Namei        73
Total     1.9     18.5    16.5     0.9     1.0  Runqueue    0.6  Dirblk        0
                                                Waitqueue   0.0
Disk    Busy%     KBPS     TPS KB-Read KB-Writ                   MEMORY

Total    32.0    356.0    89.0     0.0   356.0  PAGING           Real,MB   32768
                                                Faults       22  % Comp     14
FileSystem        KBPS     TPS KB-Read KB-Writ  Steals        0  % Noncomp   9
Total             13.5     4.2   13.5    0.0    PgspIn        0  % Client    9
                                                PgspOut       0
Name            PID  CPU%  PgSp Owner           PageIn        0  PAGING SPACE
node        7274518   4.2 453.3 iojs            PageOut       0  Size,MB     512
java        6619362   0.0  48.0 iojs            Sios          0  % Used      3
syncd       2752684   0.0   0.6 root                             % Free     97
getty       6422728   0.0   0.7 root            NFS (calls/sec)
gil         2293830   0.0   0.9 root            SerV2         0  WPAR Activ     0
IBM.Mgmt    7405802   0.0   5.8 root            CliV2         0  WPAR Total     0
j2pg        2818138   0.0   8.1 root            SerV3         0  Press: "h"-help
rmcd        5243066   0.0  12.2 root            CliV3         0         "q"-quit
xmgc         917532   0.0   0.4 root            SerV4         0
clcomd      5701812   0.0   1.8 root            CliV4         0
topasrec    6357190   0.0   1.4 root
rpc.lock    5046456   0.0   1.2 root

mhdawson commented 8 years ago

Does not look to be CPU constrained at least in the snapshot I grabbed and from having watched the output for a few minutes.

mhdawson commented 8 years ago

Nothing looks constrained, in that the disk busy%, network number etc don't seem excessive.

@georgeadams95 I think next step is to figure out if we believe its slow consistently across the test or just while testing specific modules.

Once we know that we can shorten up what we use as a recreate to help with investigation.

gdams commented 8 years ago

@mhdawson i'll take a look tomorrow and see if I can see anything obvious

gibfahn commented 8 years ago

vmstat output:

bash-4.3$ vmstat -v
              8388608 memory pages
              8133856 lruable pages
              6306949 free pages
                    4 memory pools
               788729 pinned pages
                 80.0 maxpin percentage
                  3.0 minperm percentage
                 90.0 maxperm percentage
                 11.8 numperm percentage
               961457 file pages
                  0.0 compressed percentage
                    0 compressed pages
                 11.8 numclient percentage
                 90.0 maxclient percentage
               961457 client pages
                    0 remote pageouts scheduled
                12889 pending disk I/Os blocked with no pbuf
                    0 paging space I/Os blocked with no psbuf
                 2228 filesystem I/Os blocked with no fsbuf
                 3093 client filesystem I/Os blocked with no fsbuf
               220085 external pager filesystem I/Os blocked with no fsbuf
                 13.4 percentage of memory used for computational pages

df output (in MB)

bash-4.3$ df -m
Filesystem    MB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd4         512.00    223.35   57%    17716    26% /
/dev/hd2        2880.00     65.11   98%    52092    70% /usr
/dev/hd9var      576.00    220.30   62%     6739    12% /var
/dev/hd3         128.00     76.95   40%     7592    21% /tmp
/dev/hd1          64.00     63.63    1%        7     1% /home2
/dev/hd11admin    128.00    127.63    1%        5     1% /admin
/proc                 -         -    -         -     -  /proc
/dev/hd10opt    1792.00    889.16   51%    17190     8% /opt
/dev/livedump    256.00    255.64    1%        4     1% /var/adm/ras/livedump
/dev/fslv00    57216.00  47078.34   18%   317263     3% /home
/aha                  -         -    -        40     1% /aha

gibfahn commented 8 years ago

/tmp on both machines is 128MB, which seems very small.

ccache is definitely installed and in the path, in this build I'm using export CC=$(which gcc) and export CXX=$(which g++), which means builds look like:

  /opt/freeware/bin/ccache/g++  -I../deps/gtest -I../deps/gtest/include  -pthread -Wall -Wextra -Wno-unused-parameter -maix64 -Wno-missing-field-initializers -O3 -fno-omit-frame-pointer -fno-rtti -fno-exceptions -std=gnu++0x -MMD -MF /home/iojs/build/workspace/gibfahn-citgm-smoker-more-platforms/node/aix61-ppc64/out/Release/.deps//home/iojs/build/workspace/gibfahn-citgm-smoker-more-platforms/node/aix61-ppc64/out/Release/obj.target/gtest/deps/gtest/src/gtest-death-test.o.d.raw   -c -o /home/iojs/build/workspace/gibfahn-citgm-smoker-more-platforms/node/aix61-ppc64/out/Release/obj.target/gtest/deps/gtest/src/gtest-death-test.o ../deps/gtest/src/gtest-death-test.cc

On the main CI build the build commands look like:

  g++  -I../deps/gtest -I../deps/gtest/include  -pthread -Wall -Wextra -Wno-unused-parameter -maix64 -Wno-missing-field-initializers -O3 -fno-omit-frame-pointer -fno-rtti -fno-exceptions -std=gnu++0x -MMD -MF /home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/out/Release/.deps//home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/out/Release/obj.target/gtest/deps/gtest/src/gtest-death-test.o.d.raw   -c -o /home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/out/Release/obj.target/gtest/deps/gtest/src/gtest-death-test.o ../deps/gtest/src/gtest-death-test.cc

So I'm not sure whether they're using ccache or not (I can't currently see the configuration).

@GeorgeAdams95 suggested that this vmstat configuration might not be optimal:

                  3.0 minperm percentage
                 90.0 maxperm percentage

gibfahn commented 8 years ago

Running ccache manually on the machine the cache directory looks odd:

bash-4.3$ /opt/freeware/bin/ccache/ccache -s
cache directory                     //.ccache
primary config                      //.ccache/ccache.conf
secondary config      (readonly)    /usr/local/etc/ccache.conf
cache hit (direct)                     0
cache hit (preprocessed)               0
cache miss                             0
files in cache                         0
cache size                           0.0 kB
max cache size                       5.0 GB

Actually it looks fine from within a jenkins job:

+ /opt/freeware/bin/ccache/ccache -s
cache directory                     /home/iojs/.ccache
primary config                      /home/iojs/.ccache/ccache.conf
secondary config      (readonly)    /usr/local/etc/ccache.conf
cache hit (direct)               1882768
cache hit (preprocessed)           20257
cache miss                         37345
called for link                    22387
called for preprocessing            2712
compile failed                       167
preprocessor error                    26
couldn't find the compiler             5
bad compiler arguments                39
autoconf compile/link                624
no input file                       1047
files in cache                    129510
cache size                           3.3 GB
max cache size                       5.0 GB

jbergstroem commented 8 years ago

Can confirm that ccache works as intended. Is /tmp/ used for something? If disks are slow, one way to solve it would be using a tmpfs for the npm stuff.

jbergstroem commented 8 years ago

Thinking this output is related 😄

verbose:                     | Error: ENOSPC: no space left on device, write

gdams commented 8 years ago

@jbergstroem that is just because the /tmp drive is running low. Shouldn't be affecting the job speed should it?

jbergstroem commented 8 years ago

@GeorgeAdams95 I just noticed that the test was stalling and that was the last error. Perhaps unrelated.

Anyway, npm tempdir is usually $HOME/tmp. Can we perhaps make sure we use that instead? That way we'll avoid the disk fulls.

gibfahn commented 8 years ago

@jbergstroem Would you want to use $HOME/tmp or something in the workspace? I normally just use $WORKSPACE/npm-tmp and $WORKSPACE/npm-cache.

jbergstroem commented 8 years ago

@gibfahn both sounds good; was just thinking we should make sure it doesn't use /tmp (or any other disk thats small)

mhdawson commented 8 years ago

Agreed its best if we can make sure not to use /tmp and instead use one of the other larger disks. We can, however, if necessary increase the size of /tmp if there is no other option.

gdams commented 7 years ago

this has now been rectified by skipping the time module which was taking nearly 45 minutes to run on AIX

gdams commented 7 years ago

This is still an issue, the machine is still getting a lot of npm install timeouts running citgm

maclover7 commented 6 years ago

Seems to be largely fixed in latest CITGM builds, closing for now. Please reopen if still needed.

nodejs / build

CITGM runs taking too long on AIX #546

vmstat output:

df output (in MB)