Closed okozachenko1203 closed 10 months ago
@mnaser what is your opinion?
@okozachenko1203 i think there is a CSI for cert-manager but I don't believe this would work well because we need the pod name in order to be able to generate the certificate.
Can cert-manager give us a dynamic certificate based on pod name using the CSI module? If so we can get rid of the initContainer...
the bad thing is cert-man csi-driver doesn't support POD_IP
Warning FailedMount 35s (x8 over 100s) kubelet MountVolume.SetUp failed for volume "api-tls" : rpc error: code = Unknown desc = generating certificate signing request: "csi.cert-manager.io/common-name": undefined variable "POD_IP", known variables: [POD_NAME POD_NAMESPACE POD_UID SERVICE_ACCOUNT_NAME]
@okozachenko1203 do we need the IP address for the cert? could we start using hostname instead?
Hi -- I would prioritizer this. We just got a cloud which no vms were being created due to this
The errors:
| fault | {'code': 500, 'created': '2024-01-07T07:37:33Z', 'message': 'Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance d873e0c5-6ba3-4f84-972f-b62041a873ba. Last exception: internal error: process exited while connecting to monitor: 2024-01-07T07:37:27.940999Z qemu-system-x86_64: The serve', 'details': 'Traceback (most recent call last):\n File "/var/lib/openstack/lib/python3.10/site-packages/nova/conductor/manager.py", line 688, in build_instances\n scheduler_utils.populate_retry(\n File "/var/lib/openstack/lib/python3.10/site-packages/nova/scheduler/utils.py", line 998, in populate_retry\n raise exception.MaxRetriesExceeded(reason=msg)\nnova.exception.MaxRetriesExceeded: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance d873e0c5-6ba3-4f84-972f-b62041a873ba. Last exception: internal error: process exited while connecting to monitor: 2024-01-07T07:37:27.940999Z qemu-system-x86_64: The server certificate /etc/pki/libvirt-vnc/server-cert.pem has expired\n'} |
Context
We use TLS for vnc session encryption and auth, and for libvirt api. All these TLS certificates are generated per a libvirt pod (issued inside the initContainer) and managed by cert-manager. Cert-manager does its job well so all certs are renewed automatically without a doubt but renewed certs are not reflected to libvirt. So after 90 days (90d is the default cert expiry date in cert-manager), all libvirt pods are stuck because of expired certificates. This is a kinda huge incident because we cannot do any actions on hypervisors because of this issue.
Reason
We fetch the cert content inside the initContainer of libvirt pods by using these scripts https://github.com/vexxhost/atmosphere/blob/main/charts/libvirt/values.yaml#L276-L321 https://github.com/vexxhost/atmosphere/blob/main/charts/libvirt/templates/bin/_libvirt.sh.tpl#L21-L42 As you can see, the cert contents are fetched at the generation time only once then copied to the desired paths. So the renewed certs are not reflected.
Workaround for now
Just rollout the libvirt so the new pods are created and also new certs are issued for them.
Need to find the way to reflect the renewed certs.