metrics: CI failing, may need bounds re-adjustment #94

Closed grahamwhaley closed 5 years ago

grahamwhaley commented 5 years ago

metrics CI is failing almost constantly: http://jenkins.katacontainers.io/computer/x86_packet01/builds and all fails appear to be similar:

09:48:13 Report Summary:
09:48:13 +-----+----------------------+-------+--------+--------+-------+--------+--------+------+------+-----+
09:48:13 | P/F |         NAME         |  FLR  |  MEAN  |  CEIL  |  GAP  |  MIN   |  MAX   | RNG  | COV  | ITS |
09:48:13 +-----+----------------------+-------+--------+--------+-------+--------+--------+------+------+-----+
09:48:13 | *F* | boot-times           | 95.0% | 106.7% | 105.0% | 10.0% | 103.5% | 109.0% | 5.3% | 1.4% |  20 |
09:48:13 | P   | memory-footprint     | 85.0% | 86.6%  | 115.0% | 30.0% | 86.6%  | 86.6%  | 0.0% | 0.0% |   1 |
09:48:13 | P   | memory-footprint-ksm | 95.0% | 101.2% | 105.0% | 10.0% | 101.2% | 101.2% | 0.0% | 0.0% |   1 |
09:48:13 +-----+----------------------+-------+--------+--------+-------+--------+--------+------+------+-----+

(note, it also fails the memory-footprint as well sometimes - that 86.6% is pretty near the 85% cutoff...)

This smells like something regressed (rather than the build machine is broken). I have a local script (which I'll PR one day) that tries to gather up the git history of key repos to help see what happened when. Here is the script as it stands today:


#set -x
set -e


#START=$(date -I -d "last week")
END=$(date -I)

repos="${repo_base}/runtime \
        ${repo_base}/proxy \
        ${repo_base}/shim \
        ${repo_base}/tests \
        ${repo_base}/agent \

msg() {
        local msg="$*"
        echo "${msg}" | tee -a "${REPORTFILE}"

# Blank the file
echo "" > "${REPORTFILE}"

msg "===================================="
msg "Commits from $START to $END"
msg "===================================="
for repo in $repos; do
        (go get -d -u ${repo} || true)
        git -C "${repopath}" checkout master
        git -C "${repopath}" pull >/dev/null
        msg "---------- $repo ---------------"
        TZ=UTC git -C "${repopath}" log --since "$START" --until "$END" --pretty="%h: %cd: ${repo##*/}: %s" --date=format:"%Y-%m-%dT%H:%M:%S" --no-merges | tee -a "${REPORTFILE}"
        msg ""

sort -r -k 2 < "${REPORTFILE}" > "sorted_${REPORTFILE}"

From the jenkins slave status page, it looks like things have been very broken since around 03-Dec-2018 18:24:19). Here is the top of the sorted output...

Nothing there is really leaping out at me as a major change - not even the re-vendoring or addition of tracing in my mind.

I'll have a peek at the JSON results files from the fails to see if they give any clues...

/cc @jodh-intel @sboeuf @bergwolf @GabyCT @chavafg

jodh-intel commented 5 years ago

Ditto. If we could identify the component that is responsible for the breakage, that would help a lot.

grahamwhaley commented 5 years ago

OK, what I see on build #95, which is the first one that really went wrong (well, it didn't fail the metrics bounds check - it failed to generate any metrics results.....) is: http://jenkins.katacontainers.io/job/kata-metrics-agent-ubuntu-16-04-PR/95/console

18:32:20 Error response from daemon: Could not kill running container 49ab826ba574979ae216f2ba05ff906edc60ba9eca2bb34a181a523b7458b3fd, cannot remove - Cannot kill container 49ab826ba574979ae216f2ba05ff906edc60ba9eca2bb34a181a523b7458b3fd: unknown error after kill: /usr/local/bin/kata-runtime did not terminate sucessfully: context deadline exceeded

If I then look on the physical build machine:

root@kata-metric1:~# ps -ef | fgrep containerd-shim | wc
     63    1189   24513
root@kata-metric1:~# ps -ef | fgrep -i 49ab
root      7495  6903  0 11:35 pts/0    00:00:00 grep -F --color=auto -i 49ab
root     17294     1  0 Dec03 ?        00:00:06 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/49ab826ba574979ae216f2ba05ff906edc60ba9eca2bb34a181a523b7458b3fd -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-kata-runtime -debug

So, we have a whole bunch of containerd-shims left around from that build over 2 days ago. Not exactly clear to me if they are the root issue. docker shows no containers active or stopped.

jodh-intel commented 5 years ago

I can't see that this would change the boot time, but the shim has grown quite a bit in size recently:

while [ $i -lt 10 ]
        make clean && make
        version=$(git show HEAD --format="%h %aI"|sed 's/T.*$//g'|tr ' ' '-')
        mv kata-shim "kata-shim-${version}"

        git reset --hard HEAD^


Having run the above:

$ ls -hl kata-shim-*|awk '{print $5, $NF}' | sort -t- -k5,5n -k6,6n
15M kata-shim-41b0fe8-2018-09-06
15M kata-shim-9b2891c-2018-09-12
15M kata-shim-5fbf1f0-2018-09-27
15M kata-shim-7816c4e-2018-11-02
18M kata-shim-efe185f-2018-11-06
18M kata-shim-5609963-2018-11-13
18M kata-shim-b02868b-2018-11-23
19M kata-shim-3b9408a-2018-12-03
21M kata-shim-6266dea-2018-12-05

Note the combined jump in December caused by the revendoring and trace support commits (last 2 in the list above).

jodh-intel commented 5 years ago

Errm, I'm confused - I thought we cleaned the environment between metrics PR runs?

grahamwhaley commented 5 years ago

Rebooted - came up fine. I have nudged a rebuild of the last failed metrics test: http://jenkins.katacontainers.io/job/kata-metrics-runtime-ubuntu-16-04-PR/198/

jodh-intel commented 5 years ago

(Cleaned, as opposed to creating an entirely new environment that is due to system performance issues wasn't it?)

jodh-intel commented 5 years ago

I wonder if we could get some sort of monitoring system setup on that system. We could run sar or nagios or similar before each metrics run and if they find anything unusual, ping a message to irc/ML?

grahamwhaley commented 5 years ago

@jodh-intel - we try very hard to clean the environment yes - see https://github.com/kata-containers/tests/blob/master/.ci/x86_64/clean_up_x86_64.sh - but, it is pretty hard to guarantee we have cleaned everything. Looks like we have found the next case that we don't currently handle - a hung up docker component.... @chavafg @Pennyzct for any more ideas...

as for monitoring - it is probably a question of how much we monitor. We try to get everything clean, and we have some 'sanity checks' in the tests now so we might in theory spot and fail if there are kata items still around that should not be. But, where do we draw the line. We were not expecting docker components for instance...

We may be moving some of this infra to Zuul, and there I believe we can have it deploy the packet.net bare machines on demand - so, fingers crossed, maybe this pain goes away... maybe.

grahamwhaley commented 5 years ago

Reasons we clean instead of use a new clean machine/environment today are two fold:

grahamwhaley commented 5 years ago

And metrics builds look to be back to passing http://jenkins.katacontainers.io/computer/x86_packet01/builds seems those containerd-shims or something related were the issue. I'll have a look/think about if we can handle that case in cleanup before I close this Issue.

jodh-intel commented 5 years ago

Nice. I'm tempted to suggest that between runs (and after checking for rogue processes / stale mounts) we:

grahamwhaley commented 5 years ago

We try a bunch of stuff - we run this before each PR run on the machine: https://github.com/kata-containers/tests/blob/master/.ci/lib.sh#L249-L271

gen_clean_arch() {
    # Set up some vars
    stale_process_union=( "docker-containerd-shim" )
    #docker supports different storage driver, such like overlay2, aufs, etc.
    docker_storage_driver=$(timeout ${KATA_DOCKER_TIMEOUT} docker info --format='{{.Driver}}')
    stale_docker_mount_point_union=( "/var/lib/docker/containers" "/var/lib/docker/${docker_storage_driver}" )
    stale_docker_dir_union=( "/var/lib/docker" )
    stale_kata_dir_union=( "/var/lib/vc" "/run/vc" )

    info "kill stale process"
    info "delete stale docker resource under ${stale_docker_dir_union[@]}"
    info "delete stale kata resource under ${stale_kata_dir_union[@]}"
    info "Remove installed kata packages"
    ${GOPATH}/src/${tests_repo}/cmd/kata-manager/kata-manager.sh remove-packages
    info "Remove installed kubernetes packages and configuration"
    if [ "$ID" == ubuntu ]; then
        sudo rm -rf /etc/systemd/system/kubelet.service.d
        sudo apt-get purge kubeadm kubelet kubectl -y

but every now and then we find a case we don't cover - this seems to be one. I think from that code, we are even trying to kill those shim processes already - maybe they were really really stuck and zombied or similar :-( I have a feeling we can never be 100% certain we can always get a machine back to a fully clean state without a re-boot or re-install/deploy.

jodh-intel commented 5 years ago

Agreed. Oh for a quick to reboot system and a snapshotting FS that so you can just "undo" all the changes the last PR run did.

grahamwhaley commented 5 years ago

There might be something not quite right with our cleanup script. It tries to sudo kill -9 the dangling processes - doesn't get much more brutal than that - but I see this in the logs on the failed jobs:

10:47:50 INFO: kill stale process
10:47:50 INFO: delete stale docker resource under /var/lib/docker
10:47:53 INFO: delete stale kata resource under /var/lib/vc /run/vc
10:47:53 INFO: Remove installed kata packages
10:47:53 Reading package lists...

So, something not quite right already. Then, I think we can try to make it a little more robust as well - by maybe checking after a short nap that the nominally killed processes have actually gone away - and abort if they have not - at least that way we might find out or realise earlier the slave is borked.

grahamwhaley commented 5 years ago

I think if I flip the pgrep delimiter to a (space), that might help... let me raise an Issue and test that out

jodh-intel commented 5 years ago

That looks like a string quoting issue - possibly a missing eval?

grahamwhaley commented 5 years ago

maybe it can be fixed with an eval and some add/remove of quoting - but I think it is easier to space delimit the PIDS in the first place and drop the quotes around the pid list expansion (it will always just be a list of space separated numbers). See https://github.com/kata-containers/tests/pull/975 and see what you think.

grahamwhaley commented 5 years ago

This is stale. we'll open a new one if we need to tweak and track more.