Closed mhdawson closed 6 years ago
Snapshot of perf data while CITGM was running:
Topas Monitor for host: power8-nodejs2 EVENTS/QUEUES FILE/TTY
Thu Nov 17 14:41:33 2016 Interval: 2 Cswitch 383 Readch 13887
Syscall 277 Writech 231
CPU User% Kern% Wait% Idle% Physc Entc Reads 47 Rawin 0
ALL 1.5 2.8 0.0 95.7 0.08 7.8 Writes 45 Ttyout 76
Forks 0 Igets 0
Network KBPS I-Pack O-Pack KB-In KB-Out Execs 0 Namei 73
Total 1.9 18.5 16.5 0.9 1.0 Runqueue 0.6 Dirblk 0
Waitqueue 0.0
Disk Busy% KBPS TPS KB-Read KB-Writ MEMORY
Total 32.0 356.0 89.0 0.0 356.0 PAGING Real,MB 32768
Faults 22 % Comp 14
FileSystem KBPS TPS KB-Read KB-Writ Steals 0 % Noncomp 9
Total 13.5 4.2 13.5 0.0 PgspIn 0 % Client 9
PgspOut 0
Name PID CPU% PgSp Owner PageIn 0 PAGING SPACE
node 7274518 4.2 453.3 iojs PageOut 0 Size,MB 512
java 6619362 0.0 48.0 iojs Sios 0 % Used 3
syncd 2752684 0.0 0.6 root % Free 97
getty 6422728 0.0 0.7 root NFS (calls/sec)
gil 2293830 0.0 0.9 root SerV2 0 WPAR Activ 0
IBM.Mgmt 7405802 0.0 5.8 root CliV2 0 WPAR Total 0
j2pg 2818138 0.0 8.1 root SerV3 0 Press: "h"-help
rmcd 5243066 0.0 12.2 root CliV3 0 "q"-quit
xmgc 917532 0.0 0.4 root SerV4 0
clcomd 5701812 0.0 1.8 root CliV4 0
topasrec 6357190 0.0 1.4 root
rpc.lock 5046456 0.0 1.2 root
Does not look to be CPU constrained at least in the snapshot I grabbed and from having watched the output for a few minutes.
Nothing looks constrained, in that the disk busy%, network number etc don't seem excessive.
@georgeadams95 I think next step is to figure out if we believe its slow consistently across the test or just while testing specific modules.
Once we know that we can shorten up what we use as a recreate to help with investigation.
@mhdawson i'll take a look tomorrow and see if I can see anything obvious
bash-4.3$ vmstat -v
8388608 memory pages
8133856 lruable pages
6306949 free pages
4 memory pools
788729 pinned pages
80.0 maxpin percentage
3.0 minperm percentage
90.0 maxperm percentage
11.8 numperm percentage
961457 file pages
0.0 compressed percentage
0 compressed pages
11.8 numclient percentage
90.0 maxclient percentage
961457 client pages
0 remote pageouts scheduled
12889 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
2228 filesystem I/Os blocked with no fsbuf
3093 client filesystem I/Os blocked with no fsbuf
220085 external pager filesystem I/Os blocked with no fsbuf
13.4 percentage of memory used for computational pages
bash-4.3$ df -m
Filesystem MB blocks Free %Used Iused %Iused Mounted on
/dev/hd4 512.00 223.35 57% 17716 26% /
/dev/hd2 2880.00 65.11 98% 52092 70% /usr
/dev/hd9var 576.00 220.30 62% 6739 12% /var
/dev/hd3 128.00 76.95 40% 7592 21% /tmp
/dev/hd1 64.00 63.63 1% 7 1% /home2
/dev/hd11admin 128.00 127.63 1% 5 1% /admin
/proc - - - - - /proc
/dev/hd10opt 1792.00 889.16 51% 17190 8% /opt
/dev/livedump 256.00 255.64 1% 4 1% /var/adm/ras/livedump
/dev/fslv00 57216.00 47078.34 18% 317263 3% /home
/aha - - - 40 1% /aha
/tmp
on both machines is 128MB, which seems very small.
ccache is definitely installed and in the path, in this build I'm using export CC=$(which gcc)
and export CXX=$(which g++)
, which means builds look like:
/opt/freeware/bin/ccache/g++ -I../deps/gtest -I../deps/gtest/include -pthread -Wall -Wextra -Wno-unused-parameter -maix64 -Wno-missing-field-initializers -O3 -fno-omit-frame-pointer -fno-rtti -fno-exceptions -std=gnu++0x -MMD -MF /home/iojs/build/workspace/gibfahn-citgm-smoker-more-platforms/node/aix61-ppc64/out/Release/.deps//home/iojs/build/workspace/gibfahn-citgm-smoker-more-platforms/node/aix61-ppc64/out/Release/obj.target/gtest/deps/gtest/src/gtest-death-test.o.d.raw -c -o /home/iojs/build/workspace/gibfahn-citgm-smoker-more-platforms/node/aix61-ppc64/out/Release/obj.target/gtest/deps/gtest/src/gtest-death-test.o ../deps/gtest/src/gtest-death-test.cc
On the main CI build the build commands look like:
g++ -I../deps/gtest -I../deps/gtest/include -pthread -Wall -Wextra -Wno-unused-parameter -maix64 -Wno-missing-field-initializers -O3 -fno-omit-frame-pointer -fno-rtti -fno-exceptions -std=gnu++0x -MMD -MF /home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/out/Release/.deps//home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/out/Release/obj.target/gtest/deps/gtest/src/gtest-death-test.o.d.raw -c -o /home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/out/Release/obj.target/gtest/deps/gtest/src/gtest-death-test.o ../deps/gtest/src/gtest-death-test.cc
So I'm not sure whether they're using ccache or not (I can't currently see the configuration).
@GeorgeAdams95 suggested that this vmstat configuration might not be optimal:
3.0 minperm percentage
90.0 maxperm percentage
Running ccache manually on the machine the cache directory looks odd:
bash-4.3$ /opt/freeware/bin/ccache/ccache -s
cache directory //.ccache
primary config //.ccache/ccache.conf
secondary config (readonly) /usr/local/etc/ccache.conf
cache hit (direct) 0
cache hit (preprocessed) 0
cache miss 0
files in cache 0
cache size 0.0 kB
max cache size 5.0 GB
Actually it looks fine from within a jenkins job:
+ /opt/freeware/bin/ccache/ccache -s
cache directory /home/iojs/.ccache
primary config /home/iojs/.ccache/ccache.conf
secondary config (readonly) /usr/local/etc/ccache.conf
cache hit (direct) 1882768
cache hit (preprocessed) 20257
cache miss 37345
called for link 22387
called for preprocessing 2712
compile failed 167
preprocessor error 26
couldn't find the compiler 5
bad compiler arguments 39
autoconf compile/link 624
no input file 1047
files in cache 129510
cache size 3.3 GB
max cache size 5.0 GB
Can confirm that ccache works as intended. Is /tmp/
used for something? If disks are slow, one way to solve it would be using a tmpfs
for the npm stuff.
Thinking this output is related 😄
verbose: | Error: ENOSPC: no space left on device, write
@jbergstroem that is just because the /tmp drive is running low. Shouldn't be affecting the job speed should it?
@GeorgeAdams95 I just noticed that the test was stalling and that was the last error. Perhaps unrelated.
Anyway, npm tempdir is usually $HOME/tmp
. Can we perhaps make sure we use that instead? That way we'll avoid the disk fulls.
@jbergstroem Would you want to use $HOME/tmp
or something in the workspace? I normally just use $WORKSPACE/npm-tmp
and $WORKSPACE/npm-cache
.
@gibfahn both sounds good; was just thinking we should make sure it doesn't use /tmp
(or any other disk thats small)
Agreed its best if we can make sure not to use /tmp and instead use one of the other larger disks. We can, however, if necessary increase the size of /tmp if there is no other option.
this has now been rectified by skipping the time module which was taking nearly 45 minutes to run on AIX
This is still an issue, the machine is still getting a lot of npm install
timeouts running citgm
Seems to be largely fixed in latest CITGM builds, closing for now. Please reopen if still needed.
@georgeadams95 is reporting that CITGM runs on AIX are taking multiple hours were as other platforms only take 30-50m.
The regular regression runs on AIX are in the 20m range which is more than linux but in line with windows/arm. So we might expect it to take a bit longer but not as longer as we are seeing.
This issue is to track our progress of narrowing down what the bottleneck might be,.