stormshift / support

This repo should serve as a central source for reporting issues with stormshift
GNU General Public License v3.0
3 stars 0 forks source link

Failed to run VMs for OCP4 cluster #186

Closed Javatar81 closed 2 months ago

Javatar81 commented 2 months ago

Failed to start VM from RHEV, e.g. ocp4bastion but also other nodes do not start.

github-actions[bot] commented 2 months ago

Heads up @cluster/ocp4-admin - the "cluster/ocp4" label was applied to this issue.

DanielFroehlich commented 2 months ago

Very strange, simple error message: image no more details.

knumskull commented 2 months ago

issue found - networks ocp4-network and ocp6-odf are not configured on any host - that's why the vms can't start. undefine the network from the VM configuration will start the VMs. I don't know where those networks have to be configured from host point of view, so I can't fix it.

DanielFroehlich commented 2 months ago

@knumskull thx for the quick look! ocp4-network is a "virtual network" using the ovirt-provider-ovn. not a physicall one. it should be available on all nodes. its use to create a totally private network for all the ocp4... cluster VMs.

Its showing as operational in the cluster: image Any ideas on how to revive that?

knumskull commented 2 months ago

Good point. Probably ovn is causing some trouble. Will investigate into that direction.

knumskull commented 2 months ago

It might be related to a certificate issue

[root@rhev ~]# openssl verify -CAfile /etc/pki/ovirt-engine/apache-ca.pem /etc/pki/ovirt-engine/certs/apache.cer
O = Red Hat, OU = prod, CN = 2023 Certificate Authority RHCSv2
error 2 at 1 depth lookup: unable to get issuer certificate
error /etc/pki/ovirt-engine/certs/apache.cer: verification failed

Are all CA and intermediate CA included in /etc/pki/ovirt-engine/apache-ca.pem ?

DanielFroehlich commented 2 months ago

Ah! Good catch! By bad - we got new certs earlier this year, which are create by a new Red Hat Internal CA certificate. Looks like I added the wrong certs to the apache-ca file, that is not the cert chain, but the server cert itself:

openssl x509 -in apache-ca.pem -noout -dates -subject
notBefore=Jan  2 16:42:32 2024 GMT
notAfter=Dec 27 16:42:32 2024 GMT
subject=O = Red Hat, OU = SolutionArchitectsDach, CN = *.stormshift.coe.muc.redhat.com

I dropped the correct root ca chain to /root/2023CertificateAuthorityRHCSv2_Chain.pem on the rhev host. With that, the verify checks out:

[root@rhev ovirt-engine]#  openssl verify -CAfile /root/2023CertificateAuthorityRHCSv2_Chain.pem /etc/pki/ovirt-engine/certs/apache.cer
/etc/pki/ovirt-engine/certs/apache.cer: OK

Can I simply replace the apache-ca.pem or does this require a special procedure?

knumskull commented 2 months ago

It might be sufficient to replace the apache-ca with a following restart of engine and ovirt-provider-ovn.

But I can check in a couple of minutes again and follow up.

knumskull commented 2 months ago

I replaced the ca-certificate the system in question is up and running again. image

Due to other changes, I had to re-create the OVN networks and they show now MTU1500 in UI. This is only a display issue in the UI. They're operating at 1442 inside the VMs.

DanielFroehlich commented 2 months ago

Thx! I started the remaining VMs, too. They are all up and running now.

@Javatar81 , the cluster is now suffering from expired certs. I did approve all pending CSR, cluster should recover now. Please check again in an hour or so. This issue is resolved, I am closing it.

Javatar81 commented 2 months ago

Thanks all for your great support. Had to approve still some missing CSRs but now the cluster is healthy and I will update soon once the cluster fully recovered