metrics: CI failing, may need bounds re-adjustment

grahamwhaley commented 5 years ago

metrics CI is failing almost constantly: http://jenkins.katacontainers.io/computer/x86_packet01/builds and all fails appear to be similar:

09:48:13 Report Summary:
09:48:13 +-----+----------------------+-------+--------+--------+-------+--------+--------+------+------+-----+
09:48:13 | P/F |         NAME         |  FLR  |  MEAN  |  CEIL  |  GAP  |  MIN   |  MAX   | RNG  | COV  | ITS |
09:48:13 +-----+----------------------+-------+--------+--------+-------+--------+--------+------+------+-----+
09:48:13 | *F* | boot-times           | 95.0% | 106.7% | 105.0% | 10.0% | 103.5% | 109.0% | 5.3% | 1.4% |  20 |
09:48:13 | P   | memory-footprint     | 85.0% | 86.6%  | 115.0% | 30.0% | 86.6%  | 86.6%  | 0.0% | 0.0% |   1 |
09:48:13 | P   | memory-footprint-ksm | 95.0% | 101.2% | 105.0% | 10.0% | 101.2% | 101.2% | 0.0% | 0.0% |   1 |
09:48:13 +-----+----------------------+-------+--------+--------+-------+--------+--------+------+------+-----+

(note, it also fails the memory-footprint as well sometimes - that 86.6% is pretty near the 85% cutoff...)

This smells like something regressed (rather than the build machine is broken). I have a local script (which I'll PR one day) that tries to gather up the git history of key repos to help see what happened when. Here is the script as it stands today:

#!/bin/bash

#set -x
set -e

REPORTFILE="report.txt"

START="2018-11-26"
#START=$(date -I -d "last week")
END=$(date -I)

repo_base="github.com/kata-containers"
repos="${repo_base}/runtime \
        ${repo_base}/proxy \
        ${repo_base}/shim \
        ${repo_base}/tests \
        ${repo_base}/agent \
        ${repo_base}/osbuilder"

msg() {
        local msg="$*"
        echo "${msg}" | tee -a "${REPORTFILE}"
}

# Blank the file
echo "" > "${REPORTFILE}"

msg "===================================="
msg "Commits from $START to $END"
msg "===================================="
for repo in $repos; do
        repopath="${GOPATH}/src/$repo"
        (go get -d -u ${repo} || true)
        git -C "${repopath}" checkout master
        git -C "${repopath}" pull >/dev/null
        msg "---------- $repo ---------------"
        TZ=UTC git -C "${repopath}" log --since "$START" --until "$END" --pretty="%h: %cd: ${repo##*/}: %s" --date=format:"%Y-%m-%dT%H:%M:%S" --no-merges | tee -a "${REPORTFILE}"
        msg ""
done

sort -r -k 2 < "${REPORTFILE}" > "sorted_${REPORTFILE}"

From the jenkins slave status page, it looks like things have been very broken since around 03-Dec-2018 18:24:19). Here is the top of the sorted output...

Commits from 2018-11-26 to 2018-12-06
018c8c1: 2018-12-06T00:13:15: runtime: vendor: Update govmm vendoring
ea74b98: 2018-12-05T15:20:13: runtime: shim: Add trace config option
eb9589d: 2018-12-05T07:47:35: tests: test: Check OCI calls
ac6a056: 2018-12-04T14:56:03: tests: cri-o: build go-md2man from sources
8f432cf: 2018-12-04T14:56:03: tests: centos: update cri-o and cri-containerd dependencies
ec02458: 2018-12-04T14:56:03: tests: centos: Send error when a packace is not available
54660f8: 2018-12-04T13:22:10: tests: integration/docker: fix memory test
da529a9: 2018-12-03T16:37:07: shim: tracing: Add opentracing support
e155d22: 2018-12-03T15:45:17: shim: main: Return exitcode rather than exiting
9e5adde: 2018-12-03T15:45:17: shim: docs: Improve README
87efb54: 2018-12-03T15:45:17: shim: docs: Add debug info to README
7093eec: 2018-12-03T08:57:46: runtime: git: Add containerd-shim-kata-v2 to .gitignore
96ed6c5: 2018-12-03T08:56:21: runtime: git: Sort .gitignore entries
10addb9: 2018-11-30T17:08:34: shim: vendor: Re-vendor agent and gRPC packages
434fff8: 2018-11-30T10:45:19: osbuilder: rootfs_builder: Dont fail if GOPATH not defined.
1bbf1e4: 2018-11-30T10:44:26: osbuilder: rootfs: Allow use host repositories.
04ce4c0: 2018-11-29T14:51:06: runtime: virtcontainers: change uint32 to uint64 for ioctl
d9e2fbc: 2018-11-29T12:51:52: tests: functional: Add test checking for container state
1c7f5e6: 2018-11-29T12:38:31: tests: functional: The terminal needs to be disabled
c25288a: 2018-11-29T11:17:08: agent: Makefile: Decide if agent will be built by seccomp tag
0bf29c8: 2018-11-29T08:33:15: runtime: config: Check factory config
fe784c1: 2018-11-29T08:28:14: runtime: config: Create function to check config options
70e4dc5: 2018-11-29T08:19:33: runtime: config: Move check code to end of LoadConfiguration
1dd4dd4: 2018-11-29T07:12:39: tests: docs: Document how to run the tests
fa9b15d: 2018-11-28T20:10:34: runtime: virtcontainers: Return the appropriate container status
877b346: 2018-11-28T16:34:41: tests: log-parser: Fix link to "Enable full debug"
fbaefc9: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service wait support
47326f5: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service Update support
7951041: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service Stats support
fd18b22: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service State support
642231b: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service Shutdown support
9ee53be: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service Resume support
cd321a3: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service ResizePty support
8c95b75: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service Pids support
8df33d3: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service Pause support
5cc016c: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service Kill support
a0e6456: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service Delete support
87f591a: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service Connect support
ec4f27b: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service CloseIO support
709bc9a: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the service Cleanup support
269c940: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add the exec service support
02f8b29: 2018-11-28T14:32:25: runtime: containerd-shim-kata-v2: add building of shimv2 into Makefile
8199d10: 2018-11-28T14:32:25: runtime: containerd-shim-kata: add unit test cases
4c5b296: 2018-11-28T14:32:03: runtime: containerd-shim-kata-v2: add the start service support
72fd6e0: 2018-11-28T14:29:35: runtime: containerd-shim-kata-v2: add the create service support
b0d5bb3: 2018-11-28T10:22:18: tests: metrics: Add sleep before gathering memory inside the container

Nothing there is really leaping out at me as a major change - not even the re-vendoring or addition of tracing in my mind.

I'll have a peek at the JSON results files from the fails to see if they give any clues...

/cc @jodh-intel @sboeuf @bergwolf @GabyCT @chavafg

jodh-intel commented 5 years ago

Ditto. If we could identify the component that is responsible for the breakage, that would help a lot.

grahamwhaley commented 5 years ago

OK, what I see on build #95, which is the first one that really went wrong (well, it didn't fail the metrics bounds check - it failed to generate any metrics results.....) is: http://jenkins.katacontainers.io/job/kata-metrics-agent-ubuntu-16-04-PR/95/console

18:32:20 Error response from daemon: Could not kill running container 49ab826ba574979ae216f2ba05ff906edc60ba9eca2bb34a181a523b7458b3fd, cannot remove - Cannot kill container 49ab826ba574979ae216f2ba05ff906edc60ba9eca2bb34a181a523b7458b3fd: unknown error after kill: /usr/local/bin/kata-runtime did not terminate sucessfully: context deadline exceeded

If I then look on the physical build machine:

root@kata-metric1:~# ps -ef | fgrep containerd-shim | wc
     63    1189   24513
root@kata-metric1:~# ps -ef | fgrep -i 49ab
root      7495  6903  0 11:35 pts/0    00:00:00 grep -F --color=auto -i 49ab
root     17294     1  0 Dec03 ?        00:00:06 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/49ab826ba574979ae216f2ba05ff906edc60ba9eca2bb34a181a523b7458b3fd -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-kata-runtime -debug

So, we have a whole bunch of containerd-shims left around from that build over 2 days ago. Not exactly clear to me if they are the root issue. docker shows no containers active or stopped.

I'll reboot the machine to get rid of them (so we can try and get running again asap).
I'll consider if/why/how the bare metal cleanup scripts may or may not be getting rid of them.
I'll look at build #95 (which was trying to test your agent opentracing PR https://github.com/kata-containers/agent/pull/415 I think @jodh-intel ), to see if something there was maybe expected to fail or could cause the build machine issues.

jodh-intel commented 5 years ago

I can't see that this would change the boot time, but the shim has grown quite a bit in size recently:

i=1
while [ $i -lt 10 ]
do
        make clean && make
        version=$(git show HEAD --format="%h %aI"|sed 's/T.*$//g'|tr ' ' '-')
        mv kata-shim "kata-shim-${version}"

        git reset --hard HEAD^

        i=$((i+1))
done

Having run the above:

$ ls -hl kata-shim-*|awk '{print $5, $NF}' | sort -t- -k5,5n -k6,6n
15M kata-shim-41b0fe8-2018-09-06
15M kata-shim-9b2891c-2018-09-12
15M kata-shim-5fbf1f0-2018-09-27
15M kata-shim-7816c4e-2018-11-02
18M kata-shim-efe185f-2018-11-06
18M kata-shim-5609963-2018-11-13
18M kata-shim-b02868b-2018-11-23
19M kata-shim-3b9408a-2018-12-03
21M kata-shim-6266dea-2018-12-05

Note the combined jump in December caused by the revendoring and trace support commits (last 2 in the list above).

jodh-intel commented 5 years ago

Errm, I'm confused - I thought we cleaned the environment between metrics PR runs?

grahamwhaley commented 5 years ago

Rebooted - came up fine. I have nudged a rebuild of the last failed metrics test: http://jenkins.katacontainers.io/job/kata-metrics-runtime-ubuntu-16-04-PR/198/

jodh-intel commented 5 years ago

(Cleaned, as opposed to creating an entirely new environment that is due to system performance issues wasn't it?)

jodh-intel commented 5 years ago

I wonder if we could get some sort of monitoring system setup on that system. We could run sar or nagios or similar before each metrics run and if they find anything unusual, ping a message to irc/ML?

grahamwhaley commented 5 years ago

@jodh-intel - we try very hard to clean the environment yes - see https://github.com/kata-containers/tests/blob/master/.ci/x86_64/clean_up_x86_64.sh - but, it is pretty hard to guarantee we have cleaned everything. Looks like we have found the next case that we don't currently handle - a hung up docker component.... @chavafg @Pennyzct for any more ideas...

as for monitoring - it is probably a question of how much we monitor. We try to get everything clean, and we have some 'sanity checks' in the tests now so we might in theory spot and fail if there are kata items still around that should not be. But, where do we draw the line. We were not expecting docker components for instance...

We may be moving some of this infra to Zuul, and there I believe we can have it deploy the packet.net bare machines on demand - so, fingers crossed, maybe this pain goes away... maybe.

grahamwhaley commented 5 years ago

Reasons we clean instead of use a new clean machine/environment today are two fold:

we had a 'clean environment' method using nested VMs - but the metrics results we got from within the VM were too noisy to be useful for CI.
- we would deploy-on-demand from Jenkins, but the jcloud plugin does not quite work with packet.net at present (packet does not show up in the cloud availability list), so we could not make that work...

grahamwhaley commented 5 years ago

And metrics builds look to be back to passing http://jenkins.katacontainers.io/computer/x86_packet01/builds seems those containerd-shims or something related were the issue. I'll have a look/think about if we can handle that case in cleanup before I close this Issue.

jodh-intel commented 5 years ago

Nice. I'm tempted to suggest that between runs (and after checking for rogue processes / stale mounts) we:

stop the docker service
kill everything related to docker/containerd/kata
remove any stale mounts
restart docker

grahamwhaley commented 5 years ago

We try a bunch of stuff - we run this before each PR run on the machine: https://github.com/kata-containers/tests/blob/master/.ci/lib.sh#L249-L271

gen_clean_arch() {
    # Set up some vars
    stale_process_union=( "docker-containerd-shim" )
    #docker supports different storage driver, such like overlay2, aufs, etc.
    docker_storage_driver=$(timeout ${KATA_DOCKER_TIMEOUT} docker info --format='{{.Driver}}')
    stale_docker_mount_point_union=( "/var/lib/docker/containers" "/var/lib/docker/${docker_storage_driver}" )
    stale_docker_dir_union=( "/var/lib/docker" )
    stale_kata_dir_union=( "/var/lib/vc" "/run/vc" )

    info "kill stale process"
    kill_stale_process
    info "delete stale docker resource under ${stale_docker_dir_union[@]}"
    delete_stale_docker_resource
    info "delete stale kata resource under ${stale_kata_dir_union[@]}"
    delete_stale_kata_resource
    info "Remove installed kata packages"
    ${GOPATH}/src/${tests_repo}/cmd/kata-manager/kata-manager.sh remove-packages
    info "Remove installed kubernetes packages and configuration"
    if [ "$ID" == ubuntu ]; then
        sudo rm -rf /etc/systemd/system/kubelet.service.d
        sudo apt-get purge kubeadm kubelet kubectl -y
    fi
}

but every now and then we find a case we don't cover - this seems to be one. I think from that code, we are even trying to kill those shim processes already - maybe they were really really stuck and zombied or similar :-( I have a feeling we can never be 100% certain we can always get a machine back to a fully clean state without a re-boot or re-install/deploy.

jodh-intel commented 5 years ago

Agreed. Oh for a quick to reboot system and a snapshotting FS that so you can just "undo" all the changes the last PR run did.

grahamwhaley commented 5 years ago

There might be something not quite right with our cleanup script. It tries to sudo kill -9 the dangling processes - doesn't get much more brutal than that - but I see this in the logs on the failed jobs:

10:47:50 INFO: kill stale process
10:47:50 kill: failed to parse argument: '12847
10:47:50 13024
10:47:50 13205
10:47:50 13428
10:47:50 13604
10:47:50 13716
10:47:50 13819
10:47:50 13911
10:47:50 14030
10:47:50 14107
10:47:50 14242
10:47:50 14292
10:47:50 14303
10:47:50 14457
10:47:50 14491
10:47:50 14493
10:47:50 14666
10:47:50 14674
10:47:50 14734
10:47:50 14857
10:47:50 14859
10:47:50 14908
10:47:50 15037
10:47:50 15099
10:47:50 15105
10:47:50 15235
10:47:50 15303
10:47:50 15317
10:47:50 15420
10:47:50 15517
10:47:50 15522
10:47:50 15591
10:47:50 15726
10:47:50 15736
10:47:50 15800
10:47:50 15911
10:47:50 15960
10:47:50 15977
10:47:50 16108
10:47:50 16151
10:47:50 16202
10:47:50 16338
10:47:50 16342
10:47:50 16418
10:47:50 16532
10:47:50 16548
10:47:50 16647
10:47:50 16712
10:47:50 16739
10:47:50 16857
10:47:50 16952
10:47:50 16969
10:47:50 17103
10:47:50 17142
10:47:50 17294
10:47:50 17349
10:47:50 17553
10:47:50 17779
10:47:50 17986
10:47:50 18184
10:47:50 21506
10:47:50 28309'
10:47:50 INFO: delete stale docker resource under /var/lib/docker
10:47:53 INFO: delete stale kata resource under /var/lib/vc /run/vc
10:47:53 INFO: Remove installed kata packages
10:47:53 Reading package lists...

So, something not quite right already. Then, I think we can try to make it a little more robust as well - by maybe checking after a short nap that the nominally killed processes have actually gone away - and abort if they have not - at least that way we might find out or realise earlier the slave is borked.

grahamwhaley commented 5 years ago

I think if I flip the pgrep delimiter to a (space), that might help... let me raise an Issue and test that out

jodh-intel commented 5 years ago

That looks like a string quoting issue - possibly a missing eval?

grahamwhaley commented 5 years ago

maybe it can be fixed with an eval and some add/remove of quoting - but I think it is easier to space delimit the PIDS in the first place and drop the quotes around the pid list expansion (it will always just be a list of space separated numbers). See https://github.com/kata-containers/tests/pull/975 and see what you think.

grahamwhaley commented 5 years ago

This is stale. we'll open a new one if we need to tweak and track more.

kata-containers / ci

metrics: CI failing, may need bounds re-adjustment #94