kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.17k stars 6.48k forks source link

Wrong data for environment variable HOSTS in task Gen_certs - run cert generation script for all clients #10757

Closed neiromc closed 10 months ago

neiromc commented 10 months ago

I found wrong data for environment variable HOSTS in task Gen_certs | run cert generation script for all clients in file roles/etcd/tasks/gen_certs_script.yml (in line 54) This behavior stop creating cluster because next task can't find certificates in /etc/ssl/etcd/ for worker nodes.

The problem is in use module ansible.builtin.intersect in this line:

HOSTS: "{{ groups['gen_node_certs_True'] | ansible.builtin.intersect(groups['k8s_cluster']) | join(' ') }}"

The main problem is that groups['gen_node_certs_True'] array contains only masters hosts and when is intersect with groups['k8s_cluster'] in result we has only master hosts because groups['k8s_cluster'] does not contains masters. But in this place we expect reverse behavior and to have all nodes except masters will return. When I replaced ansible.builtin.intersect to ansible.builtin.symmetric_difference the HOSTS variable has expected result (all nodes except master nodes) and all working as expected.

Environment:

Kubespray version (commit) (git rev-parse --short HEAD): aea150e5d

Network plugin used: cilium

Command used to invoke ansible: ansible-playbook --become --become-user=root cluster.yml

My Inventory file:

[all]
dc1-master-001.domain.local etcd_member_name=etcd1
dc1-master-002.domain.local etcd_member_name=etcd2
dc1-master-003.domain.local etcd_member_name=etcd3
dc1-master-004.domain.local etcd_member_name=etcd4
dc1-master-005.domain.local etcd_member_name=etcd5
dc1-worker-[001:004].domain.local

[kube-master]
dc1-master-[001:005].domain.local

[etcd:children]
kube-master

[kube-node]
dc1-worker-[001:004].domain.local

[k8s-cluster:children]
kube-master
kube-node

[calico_rr]
VannTen commented 10 months ago

The main problem is that groups['gen_node_certs_True'] array contains only masters hosts and when is intersect with groups['k8s_cluster'] in result we has only master hosts because groups['k8s_cluster'] does not contains masters. But in this place we expect reverse behavior and to have all nodes except masters will return. When I replaced ansible.builtin.intersect to ansible.builtin.symmetric_difference the HOSTS variable has expected result (all nodes except master nodes) and all working as expected.


Group k8s_cluster is expected to contains all nodes in the cluster, control plane included, etcd excluded. (see https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ansible.md#inventory)

Is that happening recently ? (since https://github.com/kubernetes-sigs/kubespray/commit/0fb404c775aff76916945ffb3d83dc77059ed7da specifically) As you can tell from the commit, intersect here only replace a manual intersect (if in groups['k8s_cluster] + if in gen_node_certs). So if there is a bug, I believe it's more likely in the group_by creating gen_node_certs_True

Could you provide the output of your ansible command as asked by the template ?

The main problem is that groups['gen_node_certs_True'] array contains only masters hosts

That would be the source of the problem, I think.

neiromc commented 10 months ago

Yes, in my case the k8s_cluster group contains all nodes (etcd nodes too because the same as masters). I think that a bug in dynamic list gen_node_certs_True (in roles/etcd/tasks/check_certs.yml) because contains only master nodes in my. More I think that the problem with cert_files in the 0fb404c

I'll add more debug some later. Thank you!

ledroide commented 10 months ago

I am experiencing the same issue. My etcd group is the same as kube_control_plane, so is included in kube_cluster.

$ ansible -i ${AI_KTST} etcd --list-hosts
  hosts (3):
    k8ststmaster-1
    k8ststmaster-2
    k8ststmaster-3

$ ansible -i ${AI_KTST} k8s_cluster --list-hosts
  hosts (6):
    k8ststworker-1
    k8ststworker-2
    k8ststworker-3
    k8ststmaster-1
    k8ststmaster-2
    k8ststmaster-3

I get this error - when running cluster.yml or install_etcd.yml playbooks - about path not existing for worker nodes :

The full traceback is:
  File "/tmp/ansible_slurp_payload_hbuwzm_j/ansible_slurp_payload.zip/ansible/modules/slurp.py", line 102, in main
failed: [k8ststmaster-2 -> k8ststmaster-1(141.94.2.22)] (item=/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem) => {
    "ansible_loop_var": "item",
    "changed": false,
    "invocation": {
        "module_args": {
            "src": "/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem"
        }
    },
    "item": "/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem",
    "msg": "file not found: /etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem"
}
The full traceback is:
  File "/tmp/ansible_slurp_payload_563bmjpg/ansible_slurp_payload.zip/ansible/modules/slurp.py", line 102, in main
failed: [k8ststmaster-3 -> k8ststmaster-1(141.94.2.22)] (item=/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem) => {
    "ansible_loop_var": "item",
    "changed": false,
    "invocation": {
        "module_args": {
            "src": "/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem"
        }
    },
    "item": "/etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem",
    "msg": "file not found: /etc/ssl/etcd/ssl/node-k8ststworker-3-key.pem"
}

Using this version :

$ git log roles/etcd/tasks/gen_certs_script.yml
commit 0fb404c775aff76916945ffb3d83dc77059ed7da
Author: Max Gautier <mg@max.gautier.name>
Date:   Tue Dec 12 11:22:29 2023 +0100

    etcd: use dynamic group for certs generation check (#10610)

Any suggestion ?


Edit: Just checked-out files from commit before 0fb404c775aff76916945ffb3d83dc77059ed7da^ and reverted both files :

roles/etcd/tasks/check_certs.yml
roles/etcd/tasks/gen_certs_script.yml

Works for me.

VannTen commented 10 months ago

@ledroide If you could add the context of the traceback (in particular, in which tasks it occurs, it would be helpful).

So this happens only when etcd = kube_control_plane (and I guess when we need etcd certs, so calico in etcd mode or similar, which would explain why this wasn't caught in CI...) I'll take a (more precise) look when I'm back (~ end of week)

ledroide commented 10 months ago

So this happens only when etcd = kube_control_plane (and I guess when we need etcd certs, so calico in etcd mode or similar, which would explain why this wasn't caught in CI...) I'll take a (more precise) look when I'm back (~ end of week)

Yes I confirm. Th issue is in task Gen_certs | run cert generation script for all clients. Is there some commit-id that I could cherry-pick to test a fix ?

Here is my inventory :

all:
  children:
    kubernetes:
      children:
        k8s_cluster:
        etcd:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    etcd:
      children:
        kube_control_plane:
    kube_control_plane:
      hosts:
        k8ststmaster-1:
        k8ststmaster-2:
        k8ststmaster-3:
    kube_node:
      hosts:
        k8ststworker-1:
        k8ststworker-2:
        k8ststworker-3:
VannTen commented 10 months ago

Not currently (working on it) /assign

neiromc commented 10 months ago

Some debug below from TASK [etcd] Gen_certs:

debug: msg="debug_1 : HOSTS (all-gen-node) : {{ groups['gen_node_certs_True'] }}"

['dc1-master-001.domain.local', 'dc1-master-002.domain.local', 'dc1-master-003.domain.local', 'dc1-master-004.domain.local', 'dc1-master-005.domain.local']

debug: msg="debug_2 : HOSTS (all-k8s_cluster) : {{ groups['k8s_cluster'] }}"

['dc1-master-001.domain.local', 'dc1-master-002.domain.local', 'dc1-master-003.domain.local', 'dc1-master-004.domain.local', 'dc1-master-005.domain.local', 'dc1-istio-001.domain.local', 'dc1-istio-002.domain.local', 'dc1-istio-003.domain.local', 'dc1-istio-004.domain.local']

debug: msg="debug_3 : HOSTS (all-intersect) : {{ groups['gen_node_certs_True'] | ansible.builtin.intersect(groups['k8s_cluster']) | join(' ') }}"

['dc1-master-001.domain.local', 'dc1-master-002.domain.local', 'dc1-master-003.domain.local', 'dc1-master-004.domain.local', 'dc1-master-005.domain.local']

debug: msg="debug_4 : HOSTS (all-symmetric_difference) : {{ groups['gen_node_certs_True'] | ansible.builtin.symmetric_difference(groups['k8s_cluster']) | join(' ') }}"

['dc1-istio-001.domain.local', 'dc1-istio-002.domain.local', 'dc1-istio-003.domain.local', 'dc1-istio-004.domain.local']
VannTen commented 10 months ago

I think I've figured it out. The install_etcd playbook actually play the etcd role two times, one for kube_control_plane:etcd and one for k8s_cluster (to avoid clients certs generation when not needing them (== if network_plugin does not talk to etcd).

Thus cluster node are not in the hosts plays, and group_by does not apply to them when creating gen_node_certs_True. Oddly enough, it seems this needs at least 3 etcd+master hosts, (1 does not cut it) with 1 node. I'm going to try making the calico_etcd_datastore CI job a reproducer for that case as well, and I'll should have a patch shortly.

VannTen commented 10 months ago

The linked PR (#10769 ) should fix the issue, if you can test...

neiromc commented 10 months ago

The linked PR (#10769 ) should fix the issue, if you can test...

Yes, it's work like a charm! All necessary certificates are generated as expected.

derselbst commented 10 months ago

I just ran into the same issue and can also confirm, that the linked PR fixes the issue. Thanks!

neiromc commented 10 months ago

@VannTen Thank you!

neiromc commented 10 months ago

I found problem with the same behavior while cluster scaling in task "TASK [etcd : Gen_certs | Gather node certs from first etcd node]" I try to scale current cluster with new node node-dc1-worker-001.domain.local in cluster (node added to kube-node group in inventory): ansible-playbook --become --become-user=root scale.yml

...
failed: [dc1-master-005.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
failed: [dc1-master-004.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
failed: [dc1-master-003.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
...
VannTen commented 10 months ago

The regression should not be present in 2.23.x, this was introduced recently.

KleinenberG commented 2 weeks ago

I found problem with the same behavior while cluster scaling in task "TASK [etcd : Gen_certs | Gather node certs from first etcd node]" I try to scale current cluster with new node node-dc1-worker-001.domain.local in cluster (node added to kube-node group in inventory): ansible-playbook --become --become-user=root scale.yml

...
failed: [dc1-master-005.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
failed: [dc1-master-004.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
failed: [dc1-master-003.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
...

Hi, @neiromc I also encounter this problem when adding a worker node. Have you managed to solve the problem?

Hi, @VannTen Maybe you can suggest something?

When debugging, my HOSTS variable only contains master nodes, therefore client certificates are not generated for the new worker node.

VannTen commented 2 weeks ago

You should probably open a new bug report with all the info

trickyut commented 2 weeks ago

I found problem with the same behavior while cluster scaling in task "TASK [etcd : Gen_certs | Gather node certs from first etcd node]" I try to scale current cluster with new node node-dc1-worker-001.domain.local in cluster (node added to kube-node group in inventory): ansible-playbook --become --become-user=root scale.yml

...
failed: [dc1-master-005.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
failed: [dc1-master-004.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
failed: [dc1-master-003.domain.local -> dc1-master-001.domain.local] (item=/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem) => {"ansible_loop_var": "item", "changed": false, "item": "/etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem", "msg": "file not found: /etc/ssl/etcd/ssl/node-dc1-worker-001.domain.local-key.pem"}
...

Hi, @neiromc I also encounter this problem when adding a worker node. Have you managed to solve the problem?

Hi, @VannTen Maybe you can suggest something?

When debugging, my HOSTS variable only contains master nodes, therefore client certificates are not generated for the new worker node.

Hello @KleinenberG , I've got exactly the same problem with adding new worker nodes with scale.yml. Certificates for these new hosts are not generated and because of that whole process fails due to missing cert files. Did you open new bug already?