okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.67k stars 289 forks source link

Can't install on vSphere using IPI if vSphere is in thumbprint certificate mode #1895

Closed psy-q closed 4 months ago

psy-q commented 4 months ago

Describe the bug The installer terminates after an initial Terraform stage with the error:

ERROR Error: failed to upload: Post "https://one.of.our.esxi.hosts/nfc/5207c2bc-9893-3709-9706-dee0ed675d9d/disk-0.vmdk": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "VMware Installer")

If the VMware Installer CA isn't in the trust store of the machine running the installer. Extracting the castore.pem CA certificate from each individual ESXi host and adding it to the installing machine's trust store allows the installation to continue.

Our setup seems typical with ESXi hosts showing up fine in vSphere and the installer host is trusting vSphere's CA. Why is it necessary to also trust each individual ESXi host's VMware Installer CA, or in other words why does openshift-install need to upload disk images there via HTTP POST?

It seems there's no option in the installer to trust unknown authorities/disable certificate validation for the vSphere/VMware components during installation.

Version 4.14.0-0.okd-2024-01-26-175629 IPI on vSphere

How reproducible 100%

Log bundle gather-logs stage is never reached, therefore there's no log bundle.

vrutkovs commented 4 months ago

Extracting the castore.pem CA certificate from each individual ESXi host and adding it to the installing machine's trust store allows the installation to continue.

Yes, this step is documented in https://docs.okd.io/4.14/installing/installing_vsphere/installing-vsphere-installer-provisioned-customizations.html#installation-adding-vcenter-root-certificates_installing-vsphere-installer-provisioned-customizations

psy-q commented 4 months ago

I think there's a misunderstanding, as I've mentioned we did extract the vSphere CA certificate and the failure is not because of that. It's because of the individual ESXi hosts not being trusted. That's a different CA certificate that's not part of the bundle vSphere offers for download.

The issuer there is "VMware Installer" and the only way to get to those CA certs seems to be to SSH into each ESXi host individually and get it from /etc/vmware/ssl.

vrutkovs commented 4 months ago

Oh, hmm, ESXi hosts certs are not signed by vSphere CA bundle?

jcpowermac commented 4 months ago

@psy-q your machine certificates are out of sync. The CA gets updated, the ESXi hosts certificate doesn't use the updated CA. You need to renew each ESXi host cert.

https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.security.doc/GUID-ECFD1A29-0534-4118-B762-967A113D5CAA.html

psy-q commented 4 months ago

Thanks, we will give that a go, remove the individual ESXi CAs from the installer machine's trust store and retry. If it keeps working then that was the problem, I'll report as soon as we know.

jcpowermac commented 4 months ago

Thanks, we will give that a go, remove the individual ESXi CAs from the installer machine's trust store and retry. If it keeps working then that was the problem, I'll report as soon as we know.

You should only be trusting the file download from vCenter's welcome page - "Download trusted root CA certificates" This is the first time I have ever heard someone taking certificates from ESXi hosts and importing them.

psy-q commented 4 months ago

I was surprised as well, I've done a few (OCP, though) installs on vSphere and never had to do this before.

I was told that this particular vSphere infrastructure was set up using thumbprint mode (vpxd.certmgmt.mode=thumbprint) and it can't be changed to proper certificate mode. So I won't be able to test whether switching to VMCA helps. But I'm pretty confident that if VMCA is enabled from the start, you'd never encounter the problem.

Maybe some people running in thumbprint mode and stumbling upon this issue find the information useful.

I'm not sure how the cloud controller works for vSphere but I presume we will have to get it to trust each ESXi host's CA as well, unless it exclusively communicates with vCenter (we have that bit covered already).

jcpowermac commented 4 months ago

vpxd.certmgmt.mode=thumbprint

but wouldn't you have the public key of the root ca to add to the trust?

As this is OKD there is another solution cough cough its only a single line change to ignore the certificates: https://github.com/openshift/installer/blob/d8a8d2b5969701413d532d12371439a0d63033ce/data/data/vsphere/pre-bootstrap/main.tf#L17

Then just rebuild the installer for the release you are installing. I would also suppose you would need to set OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE= to the release image.

Otherwise for OCP this would need to be an RFE.

documentation regarding using your own CA and ESXi host certificates (as previously mentioned vpxd.certmgmt.mode=thumbprint) https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.security.doc/GUID-122A4236-9696-4E1F-B9E8-738855946A93.html

I'm not sure how the cloud controller works for vSphere but I presume we will have to get it to trust each ESXi host's CA as well, unless it exclusively communicates with vCenter (we have that bit covered already).

The other components do not check the certificate.

psy-q commented 4 months ago

vpxd.certmgmt.mode=thumbprint

but wouldn't you have the public key of the root ca to add to the trust?

So far what I see is that there are multiple CAs in use:

I'm clueless about VMware but I'd presume that with VMCA instead of thumbprint, the vCenter CA would reissue certs for each of the ESXi hosts and thus trusting that CA from the OKD installer host would be enough to also trust all the ESXis.

As this is OKD there is another solution cough cough its only a single line change to ignore the certificates: https://github.com/openshift/installer/blob/d8a8d2b5969701413d532d12371439a0d63033ce/data/data/vsphere/pre-bootstrap/main.tf#L17

Oh, okay, that's adventurous :grin: I hope we can get vCenter set up properly instead of having to go this route, but it's good to know we could try it.

documentation regarding using your own CA and ESXi host certificates (as previously mentioned vpxd.certmgmt.mode=thumbprint) https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.security.doc/GUID-122A4236-9696-4E1F-B9E8-738855946A93.html

Thank you, I had forwarded that earlier but was told that it will probably be too much effort to switch. We don't even need to use our own CA, VMCA with its own CA would be fine. From what I'm reading it seems thumbprint mode is only for experiments and debugging, not for production use, so if we could get away from it that would be great and it would simplify OKD installation a lot.