Open rohitsakala opened 4 years ago
Hello @rohitsakala .
To maybe reproduce that issue please let me know:
Thanks!
- o
Hi @hoegaarden,
[0] https://github.com/cloudfoundry-incubator/cf-operator-ci/blob/master/docs/concourse-deployment-steps.md [1] https://ci.flintstone.cf.cloud.ibm.com/teams/quarks/pipelines/kind-test/jobs/kind/builds/1 [2] https://github.com/pivotal-k8s/kind-on-c#build-and-run-your-own-kubernetes-
Mh ... One thing I found was that running a cluster inside a task requires quite some resources, so when the workers are not quite beefy enough and other pipelines were running on the same concourse I saw some flakes similar to that.
Could you maybe test with a single node cluster? You could do that by either setting the KIND_CONFIG
on your task or with a one-off task like this:
cat <<'EOF' > /tmp/kind.yml
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
EOF
KIND_CONFIG="$(</tmp/kind.yml)" KIND_TESTS='kubectl get nodes -o wide' \
fly -t <flyTarget> execute \
--config kind.yaml \
--privileged \
--inputs-from 'kind-test/kind'
Currently, we run kind-on-c tests only on k8s (here) and it seems pretty stable (all of the recent failed & aborted runs are either fixed in kind-on-c or where infrastructure flakes). We also ran the same pipeline on a BOSH deployed concourse for a while, but we didn't see any fundamental difference and decommissioned that recently.
So I am not sure where your issue comes from. In the logs you provided I don't see anything standing out, but it might be useful to inspect the kubelet logs (something like: docker exec -ti kind-control-plane journalctl -f -u kubelet
) of the failed cluster.
Hi @hoegaarden, I have also encountered this same issue in my concourse deployment. To jump start this discussion again, some additional context and logging are provided from my environment.
kind create cluster
fail with the following errors. It fails at the control plan creation step, indicating that kubelet
is not up and running:
โ Starting control-plane ๐น๏ธ
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --ignore-preflight-errors=all --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
...
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.
systemctl
indicates that the kubelet
process is running:
docker exec -ti kind-control-plane systemctl status
โ kind-control-plane
State: running
Jobs: 0 queued
Failed: 0 units
Since: Thu 2020-06-04 23:25:09 UTC; 2min 45s ago
CGroup: /docker/9cdfcd0517038a3885423d143b6e23b7038d4fab39654f3c5650a4d35efa0
b1c
โโ539 systemctl status
โโ544 pager
โโinit.scope
โ โโ1 /sbin/init
โโsystem.slice
โโsystemd-journald.service
โ โโ69 /lib/systemd/systemd-journald
โโkubelet.service
โ โโ524 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootst
rap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kub
elet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/co
ntainerd/containerd.sock --fail-swap-on=false --node-ip=172.17.0.2 --fail-swap-o
n=false
โโcontainerd.service
โโ79 /usr/local/bin/containerd
Also indicated by the kubelet /var/lib
log
# stat /var/lib/kubelet
File: /var/lib/kubelet
Size: 204 Blocks: 0 IO Block: 4096 directory
Device: 2000beh/2097342d Inode: 35912 Links: 1
Access: (0700/drwx------) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-06-05 00:08:51.298551513 +0000
Modify: 2020-06-05 00:09:02.118571364 +0000
Change: 2020-06-05 00:09:02.118571364 +0000
Birth: -
journalctl -f -u kubelet
shows the following log snippet:
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659121 15769 factory.go:170] Factory "raw" can handle container "/docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-r5229eff0ab4f46ce8b0ee03fa4417af3.scope", but ignoring.
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659130 15769 manager.go:908] ignoring container "/docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-r5229eff0ab4f46ce8b0ee03fa4417af3.scope"
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659142 15769 factory.go:177] Factory "containerd" was unable to handle container "/docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-r744d16eb23134609a5f07d2fe3a37df7.scope"
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659156 15769 factory.go:166] Error trying to work out if we can handle /docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-r744d16eb23134609a5f07d2fe3a37df7.scope: /docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-r744d16eb23134609a5f07d2fe3a37df7.scope not handled by systemd handler
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659160 15769 factory.go:177] Factory "systemd" was unable to handle container "/docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-r744d16eb23134609a5f07d2fe3a37df7.scope"
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659167 15769 factory.go:170] Factory "raw" can handle container "/docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-r744d16eb23134609a5f07d2fe3a37df7.scope", but ignoring.
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659175 15769 manager.go:908] ignoring container "/docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-r744d16eb23134609a5f07d2fe3a37df7.scope"
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659181 15769 factory.go:177] Factory "containerd" was unable to handle container "/docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-re8f9ae6db0a945e99650562669d98d37.scope"
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659188 15769 factory.go:166] Error trying to work out if we can handle /docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-re8f9ae6db0a945e99650562669d98d37.scope: /docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-re8f9ae6db0a945e99650562669d98d37.scope not handled by systemd handler
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659195 15769 factory.go:177] Factory "systemd" was unable to handle container "/docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-re8f9ae6db0a945e99650562669d98d37.scope"
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659203 15769 factory.go:170] Factory "raw" can handle container "/docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-re8f9ae6db0a945e99650562669d98d37.scope", but ignoring.
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659210 15769 manager.go:908] ignoring container "/docker/fade85e441a38c12f4541f4ede401da21129e7864485d1060f8de7ca811f349b/system.slice/run-re8f9ae6db0a945e99650562669d98d37.scope"
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659215 15769 factory.go:177] Factory "containerd" was unable to handle container "/system.slice/run-r589264aff0c046e5879f457098db6c92.scope"
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659221 15769 factory.go:166] Error trying to work out if we can handle /system.slice/run-r589264aff0c046e5879f457098db6c92.scope: /system.slice/run-r589264aff0c046e5879f457098db6c92.scope not handled by systemd handler
Jun 05 01:09:56 kind-control-plane kubelet[15769]: I0605 01:09:56.659224 15769 factory.go:177] Factory "systemd" was unable to handle container "/system.slice/run-r589264aff0c046e5879f457098db6c92.scope
Full log can be found here: https://publicly-exposed.s3-us-west-2.amazonaws.com/exit
It appears that the none of the supported container runtimes including systemd
, containerd
or raw
can handle creating the necessary containers.
kind version: kind v0.8.0 go1.14.2 linux/amd64
concourse version: 5.8.0
concourse deployment: https://runway-ci.svc-stage.eng.vmware.com/teams/tkg/pipelines/kindonc
docker info:
docker info
Client:
Debug Mode: true
Server:
Containers: 2
Running: 1
Paused: 0
Stopped: 1
Images: 3
Server Version: 19.03.9
Storage Driver: btrfs
Build Version: Btrfs v4.4
Library Version: 101
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.4.0-142-generic
Operating System: Ubuntu 16.04.6 LTS (containerized)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.796GiB
Name: 23b9a397-3424-4d30-7e46-ee78908c227a
ID: 5ATL:2CW7:AABY:DXVD:7SN3:WTTH:YGE3:S4IH:ZS7E:AAVF:JMFH:QFOY
Docker Root Dir: /var/lib/docker
Debug Mode: true
File Descriptors: 37
Goroutines: 55
System Time: 2020-06-05T18:26:08.892984508Z
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
git branch: https://github.com/fyangd/kind-on-c/tree/etcd_failure_repro
cc. @mauilion @figo
@fyangd -- sorry for the late reply.
One thing I found, is that kind-on-c / kind had issues with on btrfs. A workaround has been implemented in be0268d05cb4c4910b554de6b198801c2094a1ce, with that kind-on-c generally works on runway. However, especially compared to hush-house, kind-on-c is veeeerry falky on runway. I didn't get around digging deeper on why exactly that is.
If you are still interested, can you try to run your test with a recent version of kind-on-c?
hey @hoegaarden, thanks for getting back to me.
yes, I have seen the btrfs fix, and tried it on runway. It appears to be failing for some other reason. good to know that it's very flaky on runway.
I installed concourse and then used the example job in the readme. Seems like the kubelet is not running.