sieve-project / sieve

Automatic Reliability Testing for Kubernetes Controllers and Operators
BSD 2-Clause "Simplified" License
320 stars 20 forks source link

Cluster creation fails while running Sieve with kapp-controller #117

Open jerrinsg opened 1 year ago

jerrinsg commented 1 year ago

I am hitting issues when trying to run Sieve with kapp-controller.

I am able to build the controller image successfully:

$ python3 build.py -c examples/kapp-controller -m all
...

Succeeded
kapp-controller-sha256-47c5a7b5df0fc9142e825b6ce5d767760db91b7d381bd0c2ce4b7fc05256c8ee
Untagged: kbld:kapp-controller-sha256-47c5a7b5df0fc9142e825b6ce5d767760db91b7d381bd0c2ce4b7fc05256c8ee

But running Sieve with kapp-controller in learn mode fails:

$ python3 sieve.py -c examples/kapp-controller -w create -m learn --build-oracle
...
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I0309 00:17:00.337861     217 initconfiguration.go:255] loading configuration from "/kind/kubeadm.conf"
...

[FAIL] kind create cluster --image ghcr.io/sieve-project/action/node:v1.24.10-learn --config kind_configs/kind-1a-2w.yaml
Traceback (most recent call last):
  File "/Users/jshajigeorge/work/sieve/sieve.py", line 264, in setup_kind_cluster
    os_system(
  File "/Users/jshajigeorge/work/sieve/sieve_common/common.py", line 181, in os_system
    raise Exception(
Exception: Failed to execute kind create cluster --image ghcr.io/sieve-project/action/node:v1.24.10-learn --config kind_configs/kind-1a-2w.yaml with return code 1

(full logs attached in kapp-learn.err.txt)

See kubelet-log.txt for the logs exported by kind (kind export logs).

I'm trying this on a Mac

$ sw_vers
ProductName:        macOS
ProductVersion:     13.0.1
BuildVersion:       22A400
jerrinsg commented 1 year ago

Hitting the same issue on an Ubuntu VM as well:

$ python3 sieve.py -c examples/kapp-controller -w create -m learn --build-oracle
...
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.

Unfortunately, an error has occurred:
    timed out waiting for the condition

This error is likely caused by:
    - The kubelet is not running
    - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
    - 'systemctl status kubelet'
    - 'journalctl -xeu kubelet'

kapp-learn-err.txt

Kubelet log:

Mar 10 20:19:55 kind-control-plane kubelet[283]: I0310 20:19:55.315958     283 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt"
Mar 10 20:19:55 kind-control-plane kubelet[283]: E0310 20:19:55.317086     283 certificate_manager.go:471] kubernetes.io/kube-apiserver-client-kubelet: Failed while requesting a signed certificate from the control plane: cannot create certificate signing request: Post "https://kind-control-plane:6443/apis/certificates.k8s.io/v1/certificatesigningrequests": dial tcp 172.18.0.3:6443: connect: connection refused
Mar 10 20:19:55 kind-control-plane kubelet[283]: W0310 20:19:55.320524     283 sysinfo.go:203] Nodes topology is not available, providing CPU topology
Mar 10 20:19:55 kind-control-plane kubelet[283]: Error: failed to run Kubelet: invalid configuration: cgroup ["kubelet"] has some missing paths: /sys/fs/cgroup/cpuacct/kubelet.slice, /sys/fs/cgroup/hugetlb/kubelet.slice, /sys/fs/cgroup/pids/kubelet.slice, /sys/fs/cgroup/cpuset/kubelet.slice, /sys/fs/cgroup/memory/kubelet.slice, /sys/fs/cgroup/cpu/kubelet.slice, /sys/fs/cgroup/systemd/kubelet.slice

Host details:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:    22.04
Codename:   jammy

$ uname -a
Linux jerrin-virtual-machine 5.19.0-35-generic #36~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 17 15:17:25 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
lalithsuresh commented 1 year ago

It's worth sharing a note about the workaround here too (to rebuild the image).

jerrinsg commented 1 year ago

On Mac, building the Kind image locally and running Sieve again fixed this issue:

$ python3 build.py -v v1.24.10 -m learn
..
Image "kindest/node:latest" build completed.

$ python3 build.py -v v1.24.10 -m test
..
Image "kindest/node:latest" build completed.

$ python3 sieve.py -c examples/kapp-controller -w create -m learn --build-oracle
...
Generated 8 intermediate-state test plan(s) in sieve_learn_results/kapp-controller/create/learn/intermediate-state
Total time: 410.3174147605896 seconds
kapilagrawal95 commented 11 months ago

When I run the command python3 sieve.py -c examples/kapp-controller -w create -m learn --build-oracle, I get the following error: "ERROR: image: "ghcr.io/sieve-project/action/kapp-controller:learn" not present locally Cannot load image ghcr.io/sieve-project/action/kapp-controller:learn locally, try to pull from remote Error response from daemon: Head "https://ghcr.io/v2/sieve-project/action/kapp-controller/manifests/learn": denied [FAIL] docker pull ghcr.io/sieve-project/action/kapp-controller:learn"

marshtompsxd commented 11 months ago

@kapilagrawal95 The kapp-controller image is not in our github repo. You might need to build it and push it to your repo first. You can configure the repo name here: https://github.com/sieve-project/sieve/blob/main/config.json#L2