redhat-cop / cert-utils-operator

Set of functionalities around certificates packaged in a Kubernetes operator
Apache License 2.0
94 stars 35 forks source link

ServiceMonitor contains a hard-coded serverName that assumes the operator namespace is cert-utils-operator #138

Open cigna-asoria opened 2 years ago

cigna-asoria commented 2 years ago

Hi - We are on OpenShift 4.8.35 and updated our cert-utils to 1.3.10 in all our environments. But we are getting an alert message that the cert-utils metrics is down. cert-utils is installed in namespace openshift-operators and not cert-utils-operator.

The endpoint is the IP and I can get those metrics per the commands you specify in the wiki, even using the service name. But I'm getting this error: Get "https://x.x.x.x:8443/metrics": x509: certificate is valid for cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc, cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc.cluster.local, not cert-utils-operator-controller-manager-metrics-service.cert-utils-operator.svc

so, i'm wondering if the problem is in the prometheus config for server_name.

tls_config: ca_file: /etc/prometheus/certs/secret_openshift-operators_cert-utils-operator-certs_tls.crt server_name: cert-utils-operator-controller-manager-metrics-service.cert-utils-operator.svc insecure_skip_verify: false

the server_name in the Prometheus config is not valid per the error message. Can this be the problem when trying to pull metrics?

cigna-asoria commented 2 years ago

Only 12 issues listed and yet no updates? Can someone please assist

davgordo commented 2 years ago

It seems like the certificate being issued looks properly configured if the operator was installed to the openshift-operators namespace. But given that the service monitor seems to target a service in a namespace called cert-utils-operator, the DNS is not matching.

This shouldn't happen, because the template for the certificate resources takes into account the target namespace: https://github.com/redhat-cop/cert-utils-operator/blob/v1.3.10/config/helmchart/templates/certificate.yaml

In this case, it seems that the {{ Release.Namespace }} did indeed get populated, but with the wrong namespace, which makes me think somehow Helm determined the wrong value, and I'm not exactly sure how that happened.

A few assumptions to validate:

  1. I assume Helm is being used to provision the operator
  2. I assume enableCertManager=true and as a result cert-manager is providing the certificate
  3. I assume the Certificate custom resource contains dnsNames that include .openshift-operators.svc
    If any of those assumptions are incorrect, please let me know. That will change my point of view.

And one (speculative) thing to try: Assuming that Helm is confused about the target namespace, I'm curious what would happen if we were more explicit and used the --namespace flag when deploying. Perhaps that will result in the correct value substitution for {{ Release.Namespace }}.

Thanks for your patience.

cigna-asoria commented 2 years ago

Hi @davgordo - I did not install the cert-utils operator through Helm. I actually installed it through OperatorHub UI via the OpenShift Console. Can I provide you with any additional information?

davgordo commented 2 years ago

Ah okay thanks for the clarification then @cigna-asoria I'm going to see if I can recreate the issue, sounds like it should be pretty easy to recreate.

The only things that might be helpful for me to reference are:

  1. The yaml for cert-utils-operator-controller-manager-metrics-service
  2. The data from certificate secret, or if not, just a list of the DNS from the issued certificate

I might discover that the problem is not challenging to recreate in which case I'll be able to reference these things in my own environment. But if you have time, it couldn't hurt to have more info.

cigna-asoria commented 2 years ago

@davgordo - We do have cert-manager installed and I just checked, there is no certificate for cert-utils like the one provided in https://github.com/redhat-cop/cert-utils-operator/blob/v1.3.10/config/helmchart/templates/certificate.yaml

Let me get the data your requested

davgordo commented 2 years ago

Yes, so for context. When installing via Helm, we provide cert-manager support because we're making an assumption (sometimes it's a bad assumption) that users using Helm are probably targeting plain k8s.

When the target platform is OpenShift, on the other hand, there are some built-in certificate capabilities that we can leverage instead. Specifically you'll see this config in the annotations of the cert-utils-operator-controller-manager-metrics-service. Those annotations will essentially ask the platform to provide a certificate secret that matches up with the Service definition.

So with that background, I just used OLM to deploy this operator, and the result looked okay to me so far. If I decode the certificate, I see the following SANS:

Those look good because they reflect the cert-utils-operator namespace. So now I'm more curious about the certificate data and the service annotations that you are seeing in your environment.

cigna-asoria commented 2 years ago

Here is the service yaml, for DNS, how do I pull that information? I can't provide the secret since it contains certificates. I did remove the UID and IP's below.

kind: Service apiVersion: v1 metadata: annotations: service.alpha.openshift.io/serving-cert-secret-name: cert-utils-operator-certs resourceVersion: '974328279' name: cert-utils-operator-controller-manager-metrics-service managedFields:

  • manager: catalog operation: Update apiVersion: v1 fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:annotations': .: {} 'f:service.alpha.openshift.io/serving-cert-secret-name': {} 'f:labels': .: {} 'f:control-plane': {} 'f:ownerReferences': .: {} .: {} 'f:apiVersion': {} 'f:blockOwnerDeletion': {} 'f:controller': {} 'f:kind': {} 'f:name': {} 'f:uid': {} 'f:spec': 'f:ports': .: {} 'k:{"port":8443,"protocol":"TCP"}': .: {} 'f:name': {} 'f:port': {} 'f:protocol': {} 'f:targetPort': {} 'f:selector': .: {} 'f:control-plane': {} 'f:sessionAffinity': {} 'f:type': {}
  • manager: olm operation: Update apiVersion: v1 time: '2022-05-11T16:59:42Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:labels': 'f:operators.coreos.com/cert-utils-operator.openshift-operators': {} namespace: openshift-operators ownerReferences:
  • apiVersion: operators.coreos.com/v1alpha1 kind: ClusterServiceVersion name: cert-utils-operator.v1.3.10 controller: false blockOwnerDeletion: false labels: control-plane: cert-utils-operator operators.coreos.com/cert-utils-operator.openshift-operators: '' spec: ports:
  • name: https protocol: TCP port: 8443 targetPort: https selector: control-plane: cert-utils-operator clusterIP: x.x.x.x clusterIPs:
  • x.x.x.x type: ClusterIP sessionAffinity: None ipFamilies:
  • IPv4 ipFamilyPolicy: SingleStack status: loadBalancer: {}
cigna-asoria commented 2 years ago

Here is the DNS output.

Downloads % openssl x509 -in cert.crt -text -noout |grep DNS DNS:cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc, DNS:cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc.cluster.local Downloads %

cigna-asoria commented 2 years ago

@davgordo - I provided the information above. All seems right so why did Prometheus use the wrong server_name?

davgordo commented 2 years ago

So I think it doesn't look right to me, because I thought this operator is installed in the cert-utils-operator namespace, and the DNS on the cert would lead me to believe that it is installed in the openshift-operators namespace.

The operator is deployed to the cert-utils-operator namespace, right? Or did I misunderstand?

cigna-asoria commented 2 years ago

@davgordo - cert-utils is installed under openshift-operators not cert-utils-operator - that is why i think we are running into this issue.

davgordo commented 2 years ago

@davgordo - cert-utils is installed under openshift-operators not cert-utils-operator - that is why i think we are running into this issue.

Ah hah! My apologies for misunderstanding. So Prometheus is going to search for services usually by label. We can tell it what labels to search for with ServiceMonitor configuration. I would like to see that ServiceMonitor yaml if you can provide it.

My cluster spun down, but as soon as I spin back up, I will try to specify the openshift-operators namespace when I install with OLM and try again to recreate.

Wild guess but, you don't happen to have a namespace called cert-utils-operator on the same cluster, do you? Just eliminating some variables. I'm thinking a left-over Service that wasn't cleaned up from a previous installation could cause problems.

cigna-asoria commented 2 years ago

@davgordo No, we don't have a namespace called cert-utils-operator -- Let me check where I can pull the ServiceMonitor

cigna-asoria commented 2 years ago

@davgordo Found it and I think this might be the problem? I bolded it below.

Downloads># oc get ServiceMonitor cert-utils-operator-controller-manager-metrics-monitor -n openshift-operators -o yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2022-05-06T20:35:03Z" generation: 1 labels: control-plane: cert-utils-operator managedFields:

davgordo commented 2 years ago

Now we're cookin'. Server name is wrong there. Thanks for all your help with the extra info. The problem is clear now. We'll have to do some brainstorming for a fix.

cigna-asoria commented 2 years ago

@davgordo - Yeah! Please do keep me informed. I have many clusters with this issue that i definitely want to fix.

davgordo commented 2 years ago

@cigna-asoria actually, I don't know for sure whether OLM creates that service monitor automatically... Did you all configure that, or was that provided by the operator provisioning?

cigna-asoria commented 2 years ago

@davgordo - No, we did not configure that. We only upgraded/installed cert-utils instances through OperatorHub UI via the OpenShift Console. My take is that OpenShift deployed it.

davgordo commented 2 years ago

@davgordo - No, we did not configure that. We only upgraded/installed cert-utils instances through OperatorHub UI via the OpenShift Console. My take is that OpenShift deployed it.

Ah I see it in my environment too. Thanks again.

davgordo commented 2 years ago

@cigna-asoria FYI, I know it's not an ideal fix, but I am able to modify the serverName manually and this change does not get overwritten by the operator. This might help you temporarily until we make the next release.

cigna-asoria commented 2 years ago

@davgordo - Thanks, I will go that route until a fix is in place. Thanks again!

felixkrohn commented 1 year ago

This issue seems to persist as the fix linked above apparently hasn't been merged, could it be re-opened?