okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.67k stars 289 forks source link

tls: failed to verify certificate: x509: certificate is valid for <DOMAIN>, not console.redhat.com #1917

Closed CamZie closed 2 months ago

CamZie commented 2 months ago

Describe the bug After the installation of OKD 4.15 we are getting this error from the insights and authentication operator:

NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.15.3    True        False         True       10h     OAuthServerConfigObservationDegraded: failed to apply IDP Login config: tls: failed to verify certificate: x509: certificate is valid for *.apps.oc.domain.tld, *.oc.domain.tld, oc.domain.tld, not login.domain.tld  
insights                                   4.15.3    False       False         True       2d23h   Unable to report: unable to build request to connect to Insights server: Post "https://console.redhat.com/api/ingress/v1/upload": tls: failed to verify certificate: x509: certificate is valid for *.apps.oc.domain.tld, *.oc.domain.tld, oc.domain.tld, not console.redhat.com

Version

ClusterVersion: Stable at "4.15.3"

Log bundle

ClusterVersion: Stable at "4.15.3"
ClusterOperators:
        clusteroperator/authentication is degraded because OAuthServerConfigObservationDegraded: failed to apply IDP Login config: tls: failed to verify certificate: x509: certificate is valid for *.apps.oc.domain.tld, *.oc.domain.tld, oc.domain.tld, not login.domain.tld
        clusteroperator/insights is not available (Unable to report: unable to build request to connect to Insights server: Post "https://console.redhat.com/api/ingress/v1/upload": tls: failed to verify certificate: x509: certificate is valid for *.apps.oc.domain.tld, *.oc.domain.tld, oc.domain.tld, not console.redhat.com) because Unable to report: unable to build request to connect to Insights server: Post "https://console.redhat.com/api/ingress/v1/upload": tls: failed to verify certificate: x509: certificate is valid for *.apps.oc.domain.tld, *.oc.domain.tld, oc.domain.tld, not console.redhat.com

Does anyone have an idea what could be the issue?

codespearhead commented 2 months ago

The log says:

[tls] certificate is valid for *.apps.oc.domain.tld, *.oc.domain.tld, oc.domain.tld, not login.domain.tld

So you have to generate a new SSL certificate that includes login.domain.tld

CamZie commented 2 months ago

@codespearhead thanks for the tip. However "login.domain.tld" is the URL of our SSO. This is running on an external server with its own Let's Encrypt certificate.

I think this issue is most probably an issue with how the domain is being resolved, because we are getting the following errors in the DNS:

dns-default-9x8qn       linux/amd64, go1.20.12 X:strictfipsruntime, 
dns-default-9x8qn       [INFO] 10.128.0.100:39274 - 64042 "A IN login.domain.tld.oc.domain.tld. udp 63 false 1232" - - 0 6.002317855s
dns-default-9x8qn       [ERROR] plugin/errors: 2 login.domain.tld.oc.domain.tld. A: read udp 10.128.0.39:46929->9.9.9.9:53: i/o timeout
dns-default-9x8qn       [INFO] 10.128.0.121:32824 - 32981 "A IN infogw.api.openshift.com.oc.domain.tld. udp 74 false 1232" - - 0 6.001218683s
dns-default-9x8qn       [ERROR] plugin/errors: 2 infogw.api.openshift.com.oc.domain.tld. A: read udp 10.128.0.39:39464->9.9.9.9:53: i/o timeout
dns-default-9x8qn       [INFO] 10.128.0.47:53644 - 40718 "A IN api.oc.domain.tld. udp 53 false 1232" - - 0 6.002321304s
dns-default-9x8qn       [ERROR] plugin/errors: 2 api.oc.domain.tld. A: read udp 10.128.0.39:38144->9.9.9.9:53: i/o timeout

Somehow whenever it checks an external FQDN for .e.g infogw.api.openshift.com / login.domain.tld it checks the following instead infogw.api.openshift.com.oc.domain.tld / login.domain.tld.oc.domain.tld which is always appending the base domain of our cluster at the end.

codespearhead commented 2 months ago

Show us what base domain is set in your cluster.

  1. Log into the cluster:
oc login --server=<your-cluster-api-url> -u <username> -p <password>
  1. Output its base domain:
oc get ingresses.config.openshift.io cluster -o jsonpath='{.spec.domain}'
CamZie commented 2 months ago

This is the output of the command. The domain.tld is used to replace the real domain.

$ oc get ingresses.config.openshift.io cluster -o jsonpath='{.spec.domain}'

apps.oc.domain.tld
CamZie commented 2 months ago

I managed to find out the cause. It looks like the /etc/resolv.conf of the host has the search parameter configured, which is the reason why the base domain of the cluster is always appending on every DNS queries on the cluster.

# Generated by NetworkManager
###search  oc.domain.tld
nameserver .....

I removed this parameter and it works.

codespearhead commented 2 months ago

Nice!

Can you close this issue then?