Closed neiromc closed 10 months ago
The main problem is that
groups['gen_node_certs_True']
array contains only masters hosts and when is intersect withgroups['k8s_cluster']
in result we has only master hosts becausegroups['k8s_cluster']
does not contains masters. But in this place we expect reverse behavior and to have all nodes except masters will return. When I replacedansible.builtin.intersect
toansible.builtin.symmetric_difference
the HOSTS variable has expected result (all nodes except master nodes) and all working as expected.
Group k8s_cluster is expected to contains all nodes in the cluster, control plane included, etcd excluded. (see https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ansible.md#inventory)
Is that happening recently ? (since https://github.com/kubernetes-sigs/kubespray/commit/0fb404c775aff76916945ffb3d83dc77059ed7da specifically) As you can tell from the commit, intersect
here only replace a manual intersect (if in groups['k8s_cluster] + if in gen_node_certs).
So if there is a bug, I believe it's more likely in the group_by creating gen_node_certs_True
Could you provide the output of your ansible command as asked by the template ?
The main problem is that
groups['gen_node_certs_True']
array contains only masters hosts
That would be the source of the problem, I think.
Yes, in my case the k8s_cluster
group contains all nodes (etcd nodes too because the same as masters).
I think that a bug in dynamic list gen_node_certs_True
(in roles/etcd/tasks/check_certs.yml) because contains only master nodes in my.
More I think that the problem with cert_files
in the 0fb404c
I'll add more debug some later. Thank you!
I am experiencing the same issue.
My etcd
group is the same as kube_control_plane, so is included in kube_cluster.
$ ansible -i ${AI_KTST} etcd --list-hosts
hosts (3):
k8ststmaster-1
k8ststmaster-2
k8ststmaster-3
$ ansible -i ${AI_KTST} k8s_cluster --list-hosts
hosts (6):
k8ststworker-1
k8ststworker-2
k8ststworker-3
k8ststmaster-1
k8ststmaster-2
k8ststmaster-3
I get this error - when running cluster.yml or install_etcd.yml playbooks - about path not existing for worker nodes :
The full traceback is:
File "/tmp/ansible_slurp_payload_hbuwzm_j/ansible_slurp_payload.zip/ansible/modules/slurp.py", line 102, in main
failed: [k8ststmaster-2 -> k8ststmaster-1(141.94.2.22)] (item=/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem) => {
"ansible_loop_var": "item",
"changed": false,
"invocation": {
"module_args": {
"src": "/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem"
}
},
"item": "/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem",
"msg": "file not found: /etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem"
}
The full traceback is:
File "/tmp/ansible_slurp_payload_563bmjpg/ansible_slurp_payload.zip/ansible/modules/slurp.py", line 102, in main
failed: [k8ststmaster-3 -> k8ststmaster-1(141.94.2.22)] (item=/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem) => {
"ansible_loop_var": "item",
"changed": false,
"invocation": {
"module_args": {
"src": "/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem"
}
},
"item": "/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem",
"msg": "file not found: /etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem"
}
Using this version :
$ git log roles/etcd/tasks/gen_certs_script.yml
commit 0fb404c775aff76916945ffb3d83dc77059ed7da
Author: Max Gautier <mg@max.gautier.name>
Date: Tue Dec 12 11:22:29 2023 +0100
etcd: use dynamic group for certs generation check (#10610)
Any suggestion ?
Edit: Just checked-out files from commit before 0fb404c775aff76916945ffb3d83dc77059ed7da^ and reverted both files :
roles/etcd/tasks/check_certs.yml
roles/etcd/tasks/gen_certs_script.yml
Works for me.
@ledroide If you could add the context of the traceback (in particular, in which tasks it occurs, it would be helpful).
So this happens only when etcd = kube_control_plane (and I guess when we need etcd certs, so calico in etcd mode or similar, which would explain why this wasn't caught in CI...) I'll take a (more precise) look when I'm back (~ end of week)
So this happens only when etcd = kube_control_plane (and I guess when we need etcd certs, so calico in etcd mode or similar, which would explain why this wasn't caught in CI...) I'll take a (more precise) look when I'm back (~ end of week)
Yes I confirm. Th issue is in task Gen_certs | run cert generation script for all clients
.
Is there some commit-id that I could cherry-pick to test a fix ?
Here is my inventory :
all:
children:
kubernetes:
children:
k8s_cluster:
etcd:
k8s_cluster:
children:
kube_control_plane:
kube_node:
etcd:
children:
kube_control_plane:
kube_control_plane:
hosts:
k8ststmaster-1:
k8ststmaster-2:
k8ststmaster-3:
kube_node:
hosts:
k8ststworker-1:
k8ststworker-2:
k8ststworker-3:
Not currently (working on it) /assign
Some debug below from TASK [etcd]
Gen_certs:
debug: msg="debug_1 : HOSTS (all-gen-node) : {{ groups['gen_node_certs_True'] }}"
['dc1-master-001.domain.local', 'dc1-master-002.domain.local', 'dc1-master-003.domain.local', 'dc1-master-004.domain.local', 'dc1-master-005.domain.local']
debug: msg="debug_2 : HOSTS (all-k8s_cluster) : {{ groups['k8s_cluster'] }}"
['dc1-master-001.domain.local', 'dc1-master-002.domain.local', 'dc1-master-003.domain.local', 'dc1-master-004.domain.local', 'dc1-master-005.domain.local', 'dc1-istio-001.domain.local', 'dc1-istio-002.domain.local', 'dc1-istio-003.domain.local', 'dc1-istio-004.domain.local']
debug: msg="debug_3 : HOSTS (all-intersect) : {{ groups['gen_node_certs_True'] | ansible.builtin.intersect(groups['k8s_cluster']) | join(' ') }}"
['dc1-master-001.domain.local', 'dc1-master-002.domain.local', 'dc1-master-003.domain.local', 'dc1-master-004.domain.local', 'dc1-master-005.domain.local']
debug: msg="debug_4 : HOSTS (all-symmetric_difference) : {{ groups['gen_node_certs_True'] | ansible.builtin.symmetric_difference(groups['k8s_cluster']) | join(' ') }}"
['dc1-istio-001.domain.local', 'dc1-istio-002.domain.local', 'dc1-istio-003.domain.local', 'dc1-istio-004.domain.local']
I think I've figured it out. The install_etcd playbook actually play the etcd role two times, one for kube_control_plane:etcd and one for k8s_cluster (to avoid clients certs generation when not needing them (== if network_plugin does not talk to etcd).
Thus cluster node are not in the hosts plays, and group_by does not apply to them when creating gen_node_certs_True. Oddly enough, it seems this needs at least 3 etcd+master hosts, (1 does not cut it) with 1 node. I'm going to try making the calico_etcd_datastore CI job a reproducer for that case as well, and I'll should have a patch shortly.
The linked PR (#10769 ) should fix the issue, if you can test...
The linked PR (#10769 ) should fix the issue, if you can test...
Yes, it's work like a charm! All necessary certificates are generated as expected.
I just ran into the same issue and can also confirm, that the linked PR fixes the issue. Thanks!
@VannTen Thank you!
I found problem with the same behavior while cluster scaling in task "TASK [etcd : Gen_certs | Gather node certs from first etcd node]"
I try to scale current cluster with new node node-dc1-worker-001.domain.local
in cluster (node added to kube-node
group in inventory): ansible-playbook --become --become-user=root scale.yml
...
failed: [dc1-master-005.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
failed: [dc1-master-004.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
failed: [dc1-master-003.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
...
The regression should not be present in 2.23.x, this was introduced recently.
I found problem with the same behavior while cluster scaling in task "TASK [etcd : Gen_certs | Gather node certs from first etcd node]" I try to scale current cluster with new node
node-dc1-worker-001.domain.local
in cluster (node added tokube-node
group in inventory):ansible-playbook --become --become-user=root scale.yml
... failed: [dc1-master-005.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"} failed: [dc1-master-004.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"} failed: [dc1-master-003.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"} ...
Hi, @neiromc I also encounter this problem when adding a worker node. Have you managed to solve the problem?
Hi, @VannTen Maybe you can suggest something?
When debugging, my HOSTS variable only contains master nodes, therefore client certificates are not generated for the new worker node.
You should probably open a new bug report with all the info
I found problem with the same behavior while cluster scaling in task "TASK [etcd : Gen_certs | Gather node certs from first etcd node]" I try to scale current cluster with new node
node-dc1-worker-001.domain.local
in cluster (node added tokube-node
group in inventory):ansible-playbook --become --become-user=root scale.yml
... failed: [dc1-master-005.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"} failed: [dc1-master-004.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"} failed: [dc1-master-003.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"} ...
Hi, @neiromc I also encounter this problem when adding a worker node. Have you managed to solve the problem?
Hi, @VannTen Maybe you can suggest something?
When debugging, my HOSTS variable only contains master nodes, therefore client certificates are not generated for the new worker node.
Hello @KleinenberG , I've got exactly the same problem with adding new worker nodes with scale.yml. Certificates for these new hosts are not generated and because of that whole process fails due to missing cert files. Did you open new bug already?
I found wrong data for environment variable HOSTS in task
Gen_certs | run cert generation script for all clients
in fileroles/etcd/tasks/gen_certs_script.yml
(in line 54) This behavior stop creating cluster because next task can't find certificates in/etc/ssl/etcd/
for worker nodes.The problem is in use module
ansible.builtin.intersect
in this line:The main problem is that
groups['gen_node_certs_True']
array contains only masters hosts and when is intersect withgroups['k8s_cluster']
in result we has only master hosts becausegroups['k8s_cluster']
does not contains masters. But in this place we expect reverse behavior and to have all nodes except masters will return. When I replacedansible.builtin.intersect
toansible.builtin.symmetric_difference
the HOSTS variable has expected result (all nodes except master nodes) and all working as expected.Environment:
Cloud provider or hardware configuration: self-hosted virtual machines
OS (
printf "$(uname -srm)\n$(cat /etc/os-release)\n"
): Ubuntu 22.04.3 LTSVersion of Ansible (
ansible --version
): ansible [core 2.15.8]Version of Python (
python --version
): Python 3.10.12Kubespray version (commit) (
git rev-parse --short HEAD
): aea150e5dNetwork plugin used: cilium
Command used to invoke ansible: ansible-playbook --become --become-user=root cluster.yml
My Inventory file: