kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.17k stars 6.48k forks source link

Certificate Issue #10937

Closed RaMaTHA closed 8 months ago

RaMaTHA commented 8 months ago

What happened?

I'm trying to set up kubespray on bare metal. I have two servers to add to my cluster (one master, one client). The setup went smoothly the first time I ran the script (last week). Unfortunately, I had to run the setup a second time on the same server.

Now I keep getting an error, which is different not only on the latest release (release-2.24), but also on earlier builds (e.g. release-2.20 or the current master branch).

The error I get is related to openssl, where it tries to generate a x509 certificate for the api server (see error description).

What did you expect to happen?

I would have expected the script to run normally as it did before.

How can we reproduce it (as minimally and precisely as possible)?

I just followed the instructions in the readme (README.md).

OS

Linux 5.15.0-94-generic x86_64 PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy

Version of Ansible

ansible [core 2.15.9] config file = /home/kubespray/kubespray/ansible.cfg configured module search path = ['/home/kubespray/kubespray/library'] ansible python module location = /home/kubespray/kubespray/kubespray-venv/lib/python3.10/site-packages/ansible ansible collection location = /home/kubespray/.ansible/collections:/usr/share/ansible/collections executable location = /home/kubespray/kubespray/kubespray-venv/bin/ansible python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/home/kubespray/kubespray/kubespray-venv/bin/python3) jinja version = 3.1.2 libyaml = True

Version of Python

Python 3.10.12

Version of Kubespray (commit)

aeaa04ca8

Network plugin used

calico

Full inventory with variables

My inventory file:

# ## Configure 'ip' variable to bind kubernetes services on a
# ## different ip than the default iface
# ## We should set etcd_member_name for etcd cluster. The node that is not a etcd member do not need to set the value, or can set the empty string value.
[all]
node1 ansible_host=<ip-address of the host>  # ip=10.3.0.1 etcd_member_name=etcd1
# node2 ansible_host=<ip-address of the client>  # ip=10.3.0.2 etcd_member_name=etcd2
# node3 ansible_host=95.54.0.14  # ip=10.3.0.3 etcd_member_name=etcd3
# node4 ansible_host=95.54.0.15  # ip=10.3.0.4 etcd_member_name=etcd4
# node5 ansible_host=95.54.0.16  # ip=10.3.0.5 etcd_member_name=etcd5
# node6 ansible_host=95.54.0.17  # ip=10.3.0.6 etcd_member_name=etcd6

# ## configure a bastion host if your nodes are not directly reachable
# [bastion]
# bastion ansible_host=x.x.x.x ansible_user=some_user

[kube_control_plane]
node1
# node2
# node3

[etcd]
node1
# node2
# node3

[kube_node]
node1
# node3
# node4
# node5
# node6

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

Command used to invoke ansible

ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

Output of ansible run

[...]
TASK [kubernetes/control-plane : Kubeadm | Check apiserver.crt SAN IPs] ***************************************************************

failed: [node1] (item=10.233.0.1) => {"ansible_loop_var": "item", "changed": true, "cmd": ["openssl", "x509", "-noout", "-in", "/etc/kubernetes/ssl/apiserver.crt", "-checkip", "10.233.0.1"], "delta": "0:00:00.003842", "end": "2024-02-19 21:38:09.295726", "item": "10.233.0.1", "msg": "non-zero return code", "rc": 1, "start": "2024-02-19 21:38:09.291884", "stderr": "Could not open file or uri for loading certificate from /etc/kubernetes/ssl/apiserver.crt\n801B2EF0F57F0000:error:16000069:STORE routines:ossl_store_get0_loader_int:unregistered scheme:../crypto/store/store_register.c:237:scheme=file\n801B2EF0F57F0000:error:80000002:system library:file_open:No such file or directory:../providers/implementations/storemgmt/file_store.c:267:calling stat(/etc/kubernetes/ssl/apiserver.crt)\nUnable to load certificate", "stderr_lines": ["Could not open file or uri for loading certificate from /etc/kubernetes/ssl/apiserver.crt", "801B2EF0F57F0000:error:16000069:STORE routines:ossl_store_get0_loader_int:unregistered scheme:../crypto/store/store_register.c:237:scheme=file", "801B2EF0F57F0000:error:80000002:system library:file_open:No such file or directory:../providers/implementations/storemgmt/file_store.c:267:calling stat(/etc/kubernetes/ssl/apiserver.crt)", "Unable to load certificate"], "stdout": "", "stdout_lines": []}
failed: [node1] (item=127.0.0.1) => {"ansible_loop_var": "item", "changed": true, "cmd": ["openssl", "x509", "-noout", "-in", "/etc/kubernetes/ssl/apiserver.crt", "-checkip", "127.0.0.1"], "delta": "0:00:00.003949", "end": "2024-02-19 21:38:09.437272", "item": "127.0.0.1", "msg": "non-zero return code", "rc": 1, "start": "2024-02-19 21:38:09.433323", "stderr": "Could not open file or uri for loading certificate from /etc/kubernetes/ssl/apiserver.crt\n80AB6796B17F0000:error:16000069:STORE routines:ossl_store_get0_loader_int:unregistered scheme:../crypto/store/store_register.c:237:scheme=file\n80AB6796B17F0000:error:80000002:system library:file_open:No such file or directory:../providers/implementations/storemgmt/file_store.c:267:calling stat(/etc/kubernetes/ssl/apiserver.crt)\nUnable to load certificate", "stderr_lines": ["Could not open file or uri for loading certificate from /etc/kubernetes/ssl/apiserver.crt", "80AB6796B17F0000:error:16000069:STORE routines:ossl_store_get0_loader_int:unregistered scheme:../crypto/store/store_register.c:237:scheme=file", "80AB6796B17F0000:error:80000002:system library:file_open:No such file or directory:../providers/implementations/storemgmt/file_store.c:267:calling stat(/etc/kubernetes/ssl/apiserver.crt)", "Unable to load certificate"], "stdout": "", "stdout_lines": []}
failed: [node1] (item=x.x.x.x) => {"ansible_loop_var": "item", "changed": true, "cmd": ["openssl", "x509", "-noout", "-in", "/etc/kubernetes/ssl/apiserver.crt", "-checkip", "x.x.x.x"], "delta": "0:00:00.003810", "end": "2024-02-19 21:38:09.578985", "item": "x.x.x.x", "msg": "non-zero return code", "rc": 1, "start": "2024-02-19 21:38:09.575175", "stderr": "Could not open file or uri for loading certificate from /etc/kubernetes/ssl/apiserver.crt\n80EB8FBB0D7F0000:error:16000069:STORE routines:ossl_store_get0_loader_int:unregistered scheme:../crypto/store/store_register.c:237:scheme=file\n80EB8FBB0D7F0000:error:80000002:system library:file_open:No such file or directory:../providers/implementations/storemgmt/file_store.c:267:calling stat(/etc/kubernetes/ssl/apiserver.crt)\nUnable to load certificate", "stderr_lines": ["Could not open file or uri for loading certificate from /etc/kubernetes/ssl/apiserver.crt", "80EB8FBB0D7F0000:error:16000069:STORE routines:ossl_store_get0_loader_int:unregistered scheme:../crypto/store/store_register.c:237:scheme=file", "80EB8FBB0D7F0000:error:80000002:system library:file_open:No such file or directory:../providers/implementations/storemgmt/file_store.c:267:calling stat(/etc/kubernetes/ssl/apiserver.crt)", "Unable to load certificate"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ********************************************************************************************************************

PLAY RECAP ****************************************************************************************************************************
node1                      : ok=522  changed=5    unreachable=0    failed=1    skipped=660  rescued=0    ignored=3

Monday 19 February 2024  21:38:09 +0100 (0:00:00.482)       0:02:33.268 *******
===============================================================================
container-engine/runc : Download_file | Download item -------------------------------------------------------------------------- 2.99s
container-engine/crictl : Download_file | Download item ------------------------------------------------------------------------ 2.93s
download : Download_file | Download item --------------------------------------------------------------------------------------- 2.87s
container-engine/containerd : Download_file | Download item -------------------------------------------------------------------- 2.86s
container-engine/nerdctl : Download_file | Download item ----------------------------------------------------------------------- 2.83s
etcdctl_etcdutl : Download_file | Download item -------------------------------------------------------------------------------- 2.83s
container-engine/crictl : Extract_file | Unpacking archive --------------------------------------------------------------------- 2.79s
etcdctl_etcdutl : Extract_file | Unpacking archive ----------------------------------------------------------------------------- 2.71s
container-engine/containerd : Containerd | Unpack containerd archive ----------------------------------------------------------- 2.50s
container-engine/nerdctl : Extract_file | Unpacking archive -------------------------------------------------------------------- 2.32s
container-engine/validate-container-engine : Populate service facts ------------------------------------------------------------ 1.93s
etcdctl_etcdutl : Copy etcd binary --------------------------------------------------------------------------------------------- 1.82s
kubernetes/node : Enable bridge-nf-call tables --------------------------------------------------------------------------------- 1.44s
download : Download | Download files / images ---------------------------------------------------------------------------------- 1.21s
download : Extract_file | Unpacking archive ------------------------------------------------------------------------------------ 1.17s
download : Extract_file | Unpacking archive ------------------------------------------------------------------------------------ 1.10s
kubernetes/control-plane : Backup old certs and keys --------------------------------------------------------------------------- 1.00s
container-engine/nerdctl : Download_file | Create dest directory on node ------------------------------------------------------- 0.99s
container-engine/containerd : Download_file | Create dest directory on node ---------------------------------------------------- 0.97s
container-engine/runc : Download_file | Create dest directory on node ---------------------------------------------------------- 0.96s

Anything else we need to know

I have found that the error only occurs when I add my master to the cluster (where I run the script). If I only add my second server (client node) to the cluster, it still works. Does this have anything to do with my setup?

VannTen commented 8 months ago

I'm trying to set up kubespray on bare metal. I have two servers to add to my cluster (one master, one client).

Your inventory show a single host, where is your second ?

Now I keep getting an error, which is different not only on the latest release (release-2.24), but also on earlier builds (e.g. release-2.20 or the current master branch).

What do you mean ? Which versions have the problem, and which versions don't ?

VannTen commented 8 months ago

/triage needs-information

RaMaTHA commented 8 months ago

Your inventory show a single host, where is your second ?

Currently, and in my report, I have only one node enabled at a time (here, node1). If I swap (disable) node1 with (enable) node2 in my inventory.ivi, I can run the script without any errors.

For example, right now it is (simplified, short view):

[all]
node1 ansible_host=<host-ip>
# node2 ansible_host=<client-ip> 

[kube_control_plane]
node1

[etcd]
node1

[kube_node]
node1

This example doesn't work. But when I run the script on my host and enable it on my client, it works.

[all]
# node1 ansible_host=<host-ip>
node2 ansible_host=<client-ip> 

[kube_control_plane]
node2

[etcd]
node2

[kube_node]
node2

However, when I add both servers (host-node and client) to the setup, I also get the error as shown in the error description.

My inventory with both nodes enabled.

[all]
node1 ansible_host=<host-ip>
node2 ansible_host=<client-ip> 

[kube_control_plane]
node1

[etcd]
node1

[kube_node]
node2

So the question here is, why did it work perfectly the first time, but not the second time?

What do you mean ? Which versions have the problem, and which versions don't ?

I could not find out which version did not have the problem. So the error persists from release-2.20 to master.

VannTen commented 8 months ago

If you''re running with these 3 inventories in order, you'll be building two separate cluster of 1 node, then trying to build a 2 cluster nodes. This probably does not work at all.

For example, right now it is (simplified, short view):

I think you simplified too much. Could you give the full list of command runs to get the error, starting from a blank state, and all the files involved ? (including your inventory variables)

RaMaTHA commented 8 months ago

I reset the server and removed the repo, so I have a clean new system. But I am still struggling with the same error. Just to be clear, I didn't misconfigure anything. These are the exact steps I did:

[kube_control_plane] node1

[etcd] node1

[kube_node] node1


- `ansible-playbook -i inventory/mycluster/hosts.yaml  --become --become-user=root cluster.yml -kK` 

And nearly at the end of the script, I ran into the openssl error. 

My openssl version:
- `openssl version`: `OpenSSL 3.0.2 15 Mar 2022 (Library: OpenSSL 3.0.2 15 Mar 2022)`
RaMaTHA commented 8 months ago

I figured out what caused the error. After removing all running docker images with the help of sudo docker rm -f $(sudo docker ps -qa) it worked again.

@VannTen Thanks for your help.