kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.22k stars 6.5k forks source link

Connection reset on cluster.yml nodes #11412

Open bsiara opened 4 months ago

bsiara commented 4 months ago

What happened?

I have kubespray v2.25.0, kubernetes 1.29.5 version, 3 master/etcd nodes and 9 worker nodes cluster on aws ec2 instances. On this cluser I deploy vault, which is unsealed using kms keys, when vault try to connect to kms I get error:

2024-08-01T06:55:20.976Z [WARN]  core.autoseal: failed to encrypt seal health test value, seal backend may be unreachable:
  error=
  | error encrypting data: RequestError: send request failed
  | caused by: Post "https://kms-fips.eu-central-1.amazonaws.com/": read tcp 10.233.127.156:59670->52.94.205.84:443: read: connection reset by peer

2024-08-01T06:56:20.986Z [WARN]  core.autoseal: failed to encrypt seal health test value, seal backend may be unreachable:
  error=
  | error encrypting data: RequestError: send request failed
  | caused by: Post "https://kms-fips.eu-central-1.amazonaws.com/": read tcp 10.233.127.156:48680->52.94.205.84:443: read: connection reset by peer

2024-08-01T06:57:20.985Z [WARN]  core.autoseal: failed to encrypt seal health test value, seal backend may be unreachable:
  error=
  | error encrypting data: RequestError: send request failed
  | caused by: Post "https://kms-fips.eu-central-1.amazonaws.com/": read tcp 10.233.127.156:49594->52.94.204.124:443: read: connection reset by peer

2024-08-01T06:58:20.982Z [WARN]  core.autoseal: failed to encrypt seal health test value, seal backend may be unreachable:
  error=
  | error encrypting data: RequestError: send request failed
  | caused by: Post "https://kms-fips.eu-central-1.amazonaws.com/": read tcp 10.233.127.156:54478->52.94.204.124:443: read: connection reset by peer

2024-08-01T06:59:20.980Z [WARN]  core.autoseal: failed to encrypt seal health test value, seal backend may be unreachable:
  error=
  | error encrypting data: RequestError: send request failed
  | caused by: Post "https://kms-fips.eu-central-1.amazonaws.com/": read tcp 10.233.127.156:51028->52.94.205.84:443: read: connection reset by peer

When I add new node using scale.yml playbook and deploy vault on this node vault work properly and did not throw connection reset exception.

What did you expect to happen?

Get proper response from kms api.

How can we reproduce it (as minimally and precisely as possible)?

Deploy cluser on aws ec2 and deploy vault with dynamodb and kms backend.

OS

Linux 6.5.0-1020-aws x86_64 PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy

Version of Ansible

ansible [core 2.16.9] config file = /home/user/repo/kubespray/ansible.cfg configured module search path = ['/home/user/repo/kubespray/library'] ansible python module location = /home/user/repo/kubespray/venv/lib/python3.12/site-packages/ansible ansible collection location = /home/user/.ansible/collections:/usr/share/ansible/collections executable location = /home/user/repo/kubespray/venv/bin/ansible python version = 3.12.4 (main, Jul 15 2024, 12:17:32) [GCC 13.3.0] (/home/user/repo/kubespray/venv/bin/python3) jinja version = 3.1.4 libyaml = True

Version of Python

Python 3.12.4

Version of Kubespray (commit)

7e0a40725

Network plugin used

calico

Full inventory with variables

https://gist.github.com/bsiara/8f82a41aeececc9a77f1586de37bfe3d

Command used to invoke ansible

ansible-playbook -v -i inventory/nonprd-1.29.5/inventory.ini -b -u ubuntu cluster.yml

Output of ansible run

If needed I can provide from new cluser and reproduce...

Anything else we need to know

No response

tico88612 commented 1 month ago

We don't have an AWS machine to reproduce this issue. Do you have an architecture diagram? We also don't know how Dynamodb and KMS work.

BTW, this doesn't look like our problem.

/remove-kind bug /kind support