kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.17k stars 6.48k forks source link

EBS-CSI controller fails to provision volumes with authorization failure #11495

Open akini-wso2 opened 2 months ago

akini-wso2 commented 2 months ago

What happened?

Ihave provisioned a kubernetes cluster using kubespray of EC2 instances in AWS. After the cluster is successfully provisioned and all nodes are healthy and running, I installed the EBS-CSI driver by following the steps as recommended and then running the cluster.yml ansible playbook.

Initially, the ebs-csi controller pod was in crontroller pod was in crashback loop of state with the ebs-plugin container inside the pod failing. Error was 'CSI-NODE NAME NOT SET'. I was able to fix this issue by adding an env variable into the ebs-csi-controller by editing the deployment. Storage class was created as expected.

When running the sample PVC and pod, in official kubespray githup repo, the pvc was in pending state.

https://github.com/kubernetes-sigs/kubespray/blob/master/docs/CSI/aws-ebs-csi.md

Error log of ebs-csi-controller pod:

Warning ProvisioningFailed 15m ebs.csi.aws.com_ebs-csi-controller-75d79769b8-bbftz_1cfa04d6-8ed3-42f2-9834-1dfaa7687054 failed to provision volume with StorageClass "ebs-sc-new": rpc error: code = Internal desc = AuthFailure: AWS was not able to validate the provided access credentials status code: 401, request id: 16eac760-e2e5-4182-a2c1-89cff669f3bd Warning ProvisioningFailed 14m ebs.csi.aws.com_ebs-csi-controller-75d79769b8-bbftz_1cfa04d6-8ed3-42f2-9834-1dfaa7687054 failed to provision volume with StorageClass "ebs-sc-new": rpc error: code = Internal desc = RequestCanceled: request context canceled caused by: context deadline exceeded Normal Provisioning 95s (x12 over 16m) ebs.csi.aws.com_ebs-csi-controller-75d79769b8-bbftz_1cfa04d6-8ed3-42f2-9834-1dfaa7687054 External provisioner is provisioning volume for claim "default/ebs-pvc" Warning ProvisioningFailed 85s (x7 over 15m) ebs.csi.aws.com_ebs-csi-controller-75d79769b8-bbftz_1cfa04d6-8ed3-42f2-9834-1dfaa7687054 failed to provision volume with StorageClass "ebs-sc-new": rpc error: code = DeadlineExceeded desc = context deadline exceeded Normal ExternalProvisioning 57s (x62 over 16m) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

What did you expect to happen?

The pvc to bound and a volume to be created in AWS for the pod.

How can we reproduce it (as minimally and precisely as possible)?

Provision a kuberenetes cluster on AWS with EC2 instances using kubespray.

To install the ebs-csi-driver:

Uncommented the aws_ebs_csi_enabled option in group_vars/all/aws.yml and set it to true. Set persistent_volumes_enabled in group_vars/k8s_cluster/k8s_cluster.yml to true. Attached role to all the EC2 instances to allow all EBS actions Created and applied secret to provide AWS credentials (access token and key) Ran cluster.yml playbook.

To fix CSI_NODE_NAME env var not set:

kubectl edit deployment.apps/ebs-csi-controller -n kube-system env:

OS

Linux 6.5.0-1022-aws x86_64 PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy

Version of Ansible

ansible [core 2.14.17] config file = /etc/ansible/ansible.cfg configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /usr/local/lib/python3.10/dist-packages/ansible ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections executable location = /usr/local/bin/ansible python version = 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (/usr/bin/python3) jinja version = 3.1.2 libyaml = True

Version of Python

Python 3.10.12

Version of Kubespray (commit)

kubespray:v2.25.0

Network plugin used

cilium

Full inventory with variables

[all] master1 ansible_host=10.0.0.101 ip=10.0.0.101 master2 ansible_host=10.0.4.70 ip=10.0.4.70 master3 ansible_host=10.0.15.218 ip=10.0.15.218 worker1 ansible_host=10.0.21.128 ip=10.0.21.128 worker2 ansible_host=10.0.24.96 ip=10.0.24.96 etcd1 ansible_host=10.0.5.14 ip=10.0.5.14

[kube_control_plane] master2 master1

[etcd] etcd1

[kube_node] worker1 worker2

[calico_rr]

[k8s_cluster:children] kube_control_plane kube_node calico_rr

Command used to invoke ansible

sudo docker run --rm -it --mount type=bind,source=/home/ubuntu/kubespray/inventory/mycluster/,dst=/inventory --mount type=bind,source=/home/ubuntu/.ssh/id_rsa,dst=/root/.ssh/id_rsa --mount type=bind,source=/home/ubuntu/.ssh/id_rsa,dst=/home/ubuntu/.ssh/id_rsa quay.io/kubespray/kubespray:v2.25.0 bash ansible-playbook -i /inventory/inventory.ini cluster.yml --user=ubuntu --become --become-user=root --private-key=/home/ubuntu/.ssh/id_rsa -e kube_network_plugin=cilium --flush-cache

Output of ansible run

PLAY RECAP ***** etcd1 : ok=137 changed=11 unreachable=0 failed=0 skipped=340 rescued=0 ignored=0 master1 : ok=491 changed=14 unreachable=0 failed=0 skipped=950 rescued=0 ignored=1 master2 : ok=540 changed=22 unreachable=0 failed=0 skipped=1040 rescued=0 ignored=1 worker1 : ok=412 changed=15 unreachable=0 failed=0 skipped=638 rescued=0 ignored=1 worker2 : ok=412 changed=15 unreachable=0 failed=0 skipped=633 rescued=0 ignored=1

Thursday 29 August 2024 00:22:10 +0000 (0:00:00.302) 0:08:02.550 ***

container-engine/runc : Download_file | Download item ---------------------------------- 10.46s container-engine/containerd : Download_file | Download item ---------------------------- 10.18s container-engine/crictl : Download_file | Download item -------------------------------- 10.09s container-engine/nerdctl : Download_file | Download item -------------------------------- 9.99s container-engine/crictl : Extract_file | Unpacking archive ------------------------------ 7.79s kubernetes/preinstall : Update package management cache (APT) --------------------------- 7.64s container-engine/nerdctl : Extract_file | Unpacking archive ----------------------------- 6.92s kubernetes-apps/ansible : Kubernetes Apps | Start Resources ----------------------------- 5.63s download : Download_file | Download item ------------------------------------------------ 5.38s kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates ------------------ 4.90s etcdctl_etcdutl : Download_file | Download item ----------------------------------------- 4.84s kubernetes-apps/ingress_controller/ingress_nginx : NGINX Ingress Controller | Create manifests --- 4.83s download : Download | Download files / images ------------------------------------------- 4.57s kubernetes-apps/ingress_controller/ingress_nginx : NGINX Ingress Controller | Apply manifests --- 4.44s network_plugin/cilium : Cilium | Create Cilium node manifests --------------------------- 4.28s container-engine/containerd : Containerd | Unpack containerd archive -------------------- 4.10s etcdctl_etcdutl : Extract_file | Unpacking archive -------------------------------------- 3.94s kubernetes-apps/metrics_server : Metrics Server | Create manifests ---------------------- 3.75s network_plugin/cilium : Cilium | Start Resources ---------------------------------------- 3.69s container-engine/containerd : Download_file | Create dest directory on node ------------- 3.61s

Anything else we need to know

No response

tico88612 commented 2 months ago

Kubespray's EBS CSI is a bit old. If you need to fix it urgently, you can refer to the aws-ebs-csi-driver repo.