Closed grahamwhaley closed 5 years ago
Ditto. If we could identify the component that is responsible for the breakage, that would help a lot.
OK, what I see on build #95, which is the first one that really went wrong (well, it didn't fail the metrics bounds check - it failed to generate any metrics results.....) is: http://jenkins.katacontainers.io/job/kata-metrics-agent-ubuntu-16-04-PR/95/console
18:32:20 Error response from daemon: Could not kill running container 49ab826ba574979ae216f2ba05ff906edc60ba9eca2bb34a181a523b7458b3fd, cannot remove - Cannot kill container 49ab826ba574979ae216f2ba05ff906edc60ba9eca2bb34a181a523b7458b3fd: unknown error after kill: /usr/local/bin/kata-runtime did not terminate sucessfully: context deadline exceeded
If I then look on the physical build machine:
root@kata-metric1:~# ps -ef | fgrep containerd-shim | wc
63 1189 24513
root@kata-metric1:~# ps -ef | fgrep -i 49ab
root 7495 6903 0 11:35 pts/0 00:00:00 grep -F --color=auto -i 49ab
root 17294 1 0 Dec03 ? 00:00:06 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/49ab826ba574979ae216f2ba05ff906edc60ba9eca2bb34a181a523b7458b3fd -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-kata-runtime -debug
So, we have a whole bunch of containerd-shim
s left around from that build over 2 days ago. Not exactly clear to me if they are the root issue. docker shows no containers active or stopped.
I can't see that this would change the boot time, but the shim has grown quite a bit in size recently:
i=1
while [ $i -lt 10 ]
do
make clean && make
version=$(git show HEAD --format="%h %aI"|sed 's/T.*$//g'|tr ' ' '-')
mv kata-shim "kata-shim-${version}"
git reset --hard HEAD^
i=$((i+1))
done
Having run the above:
$ ls -hl kata-shim-*|awk '{print $5, $NF}' | sort -t- -k5,5n -k6,6n
15M kata-shim-41b0fe8-2018-09-06
15M kata-shim-9b2891c-2018-09-12
15M kata-shim-5fbf1f0-2018-09-27
15M kata-shim-7816c4e-2018-11-02
18M kata-shim-efe185f-2018-11-06
18M kata-shim-5609963-2018-11-13
18M kata-shim-b02868b-2018-11-23
19M kata-shim-3b9408a-2018-12-03
21M kata-shim-6266dea-2018-12-05
Note the combined jump in December caused by the revendoring and trace support commits (last 2 in the list above).
Errm, I'm confused - I thought we cleaned the environment between metrics PR runs?
Rebooted - came up fine. I have nudged a rebuild of the last failed metrics test: http://jenkins.katacontainers.io/job/kata-metrics-runtime-ubuntu-16-04-PR/198/
(Cleaned, as opposed to creating an entirely new environment that is due to system performance issues wasn't it?)
I wonder if we could get some sort of monitoring system setup on that system. We could run sar
or nagios or similar before each metrics run and if they find anything unusual, ping a message to irc/ML?
@jodh-intel - we try very hard to clean the environment yes - see https://github.com/kata-containers/tests/blob/master/.ci/x86_64/clean_up_x86_64.sh - but, it is pretty hard to guarantee we have cleaned everything. Looks like we have found the next case that we don't currently handle - a hung up docker component.... @chavafg @Pennyzct for any more ideas...
as for monitoring - it is probably a question of how much we monitor. We try to get everything clean, and we have some 'sanity checks' in the tests now so we might in theory spot and fail if there are kata items still around that should not be. But, where do we draw the line. We were not expecting docker components for instance...
We may be moving some of this infra to Zuul, and there I believe we can have it deploy the packet.net bare machines on demand - so, fingers crossed, maybe this pain goes away... maybe.
Reasons we clean instead of use a new clean machine/environment today are two fold:
And metrics builds look to be back to passing
http://jenkins.katacontainers.io/computer/x86_packet01/builds
seems those containerd-shim
s or something related were the issue.
I'll have a look/think about if we can handle that case in cleanup before I close this Issue.
Nice. I'm tempted to suggest that between runs (and after checking for rogue processes / stale mounts) we:
We try a bunch of stuff - we run this before each PR run on the machine: https://github.com/kata-containers/tests/blob/master/.ci/lib.sh#L249-L271
gen_clean_arch() {
# Set up some vars
stale_process_union=( "docker-containerd-shim" )
#docker supports different storage driver, such like overlay2, aufs, etc.
docker_storage_driver=$(timeout ${KATA_DOCKER_TIMEOUT} docker info --format='{{.Driver}}')
stale_docker_mount_point_union=( "/var/lib/docker/containers" "/var/lib/docker/${docker_storage_driver}" )
stale_docker_dir_union=( "/var/lib/docker" )
stale_kata_dir_union=( "/var/lib/vc" "/run/vc" )
info "kill stale process"
kill_stale_process
info "delete stale docker resource under ${stale_docker_dir_union[@]}"
delete_stale_docker_resource
info "delete stale kata resource under ${stale_kata_dir_union[@]}"
delete_stale_kata_resource
info "Remove installed kata packages"
${GOPATH}/src/${tests_repo}/cmd/kata-manager/kata-manager.sh remove-packages
info "Remove installed kubernetes packages and configuration"
if [ "$ID" == ubuntu ]; then
sudo rm -rf /etc/systemd/system/kubelet.service.d
sudo apt-get purge kubeadm kubelet kubectl -y
fi
}
but every now and then we find a case we don't cover - this seems to be one. I think from that code, we are even trying to kill those shim processes already - maybe they were really really stuck and zombied or similar :-( I have a feeling we can never be 100% certain we can always get a machine back to a fully clean state without a re-boot or re-install/deploy.
Agreed. Oh for a quick to reboot system and a snapshotting FS that so you can just "undo" all the changes the last PR run did.
There might be something not quite right with our cleanup script. It tries to sudo kill -9
the dangling processes - doesn't get much more brutal than that - but I see this in the logs on the failed jobs:
10:47:50 INFO: kill stale process
10:47:50 kill: failed to parse argument: '12847
10:47:50 13024
10:47:50 13205
10:47:50 13428
10:47:50 13604
10:47:50 13716
10:47:50 13819
10:47:50 13911
10:47:50 14030
10:47:50 14107
10:47:50 14242
10:47:50 14292
10:47:50 14303
10:47:50 14457
10:47:50 14491
10:47:50 14493
10:47:50 14666
10:47:50 14674
10:47:50 14734
10:47:50 14857
10:47:50 14859
10:47:50 14908
10:47:50 15037
10:47:50 15099
10:47:50 15105
10:47:50 15235
10:47:50 15303
10:47:50 15317
10:47:50 15420
10:47:50 15517
10:47:50 15522
10:47:50 15591
10:47:50 15726
10:47:50 15736
10:47:50 15800
10:47:50 15911
10:47:50 15960
10:47:50 15977
10:47:50 16108
10:47:50 16151
10:47:50 16202
10:47:50 16338
10:47:50 16342
10:47:50 16418
10:47:50 16532
10:47:50 16548
10:47:50 16647
10:47:50 16712
10:47:50 16739
10:47:50 16857
10:47:50 16952
10:47:50 16969
10:47:50 17103
10:47:50 17142
10:47:50 17294
10:47:50 17349
10:47:50 17553
10:47:50 17779
10:47:50 17986
10:47:50 18184
10:47:50 21506
10:47:50 28309'
10:47:50 INFO: delete stale docker resource under /var/lib/docker
10:47:53 INFO: delete stale kata resource under /var/lib/vc /run/vc
10:47:53 INFO: Remove installed kata packages
10:47:53 Reading package lists...
So, something not quite right already. Then, I think we can try to make it a little more robust as well - by maybe checking after a short nap that the nominally killed processes have actually gone away - and abort if they have not - at least that way we might find out or realise earlier the slave is borked.
I think if I flip the pgrep
delimiter to a
(space), that might help... let me raise an Issue and test that out
That looks like a string quoting issue - possibly a missing eval
?
maybe it can be fixed with an eval
and some add/remove of quoting - but I think it is easier to space delimit the PIDS in the first place and drop the quotes around the pid list expansion (it will always just be a list of space separated numbers). See https://github.com/kata-containers/tests/pull/975 and see what you think.
This is stale. we'll open a new one if we need to tweak and track more.
metrics CI is failing almost constantly: http://jenkins.katacontainers.io/computer/x86_packet01/builds and all fails appear to be similar:
(note, it also fails the memory-footprint as well sometimes - that 86.6% is pretty near the 85% cutoff...)
This smells like something regressed (rather than the build machine is broken). I have a local script (which I'll PR one day) that tries to gather up the git history of key repos to help see what happened when. Here is the script as it stands today:
From the jenkins slave status page, it looks like things have been very broken since around
03-Dec-2018 18:24:19)
. Here is the top of the sorted output...Nothing there is really leaping out at me as a major change - not even the re-vendoring or addition of tracing in my mind.
I'll have a peek at the JSON results files from the fails to see if they give any clues...
/cc @jodh-intel @sboeuf @bergwolf @GabyCT @chavafg