Open cigna-asoria opened 2 years ago
Only 12 issues listed and yet no updates? Can someone please assist
It seems like the certificate being issued looks properly configured if the operator was installed to the openshift-operators
namespace. But given that the service monitor seems to target a service in a namespace called cert-utils-operator
, the DNS is not matching.
This shouldn't happen, because the template for the certificate resources takes into account the target namespace: https://github.com/redhat-cop/cert-utils-operator/blob/v1.3.10/config/helmchart/templates/certificate.yaml
In this case, it seems that the {{ Release.Namespace }}
did indeed get populated, but with the wrong namespace, which makes me think somehow Helm determined the wrong value, and I'm not exactly sure how that happened.
A few assumptions to validate:
enableCertManager=true
and as a result cert-manager is providing the certificateCertificate
custom resource contains dnsNames
that include .openshift-operators.svc
And one (speculative) thing to try:
Assuming that Helm is confused about the target namespace, I'm curious what would happen if we were more explicit and used the --namespace
flag when deploying. Perhaps that will result in the correct value substitution for {{ Release.Namespace }}
.
Thanks for your patience.
Hi @davgordo - I did not install the cert-utils operator through Helm. I actually installed it through OperatorHub UI via the OpenShift Console. Can I provide you with any additional information?
Ah okay thanks for the clarification then @cigna-asoria I'm going to see if I can recreate the issue, sounds like it should be pretty easy to recreate.
The only things that might be helpful for me to reference are:
cert-utils-operator-controller-manager-metrics-service
I might discover that the problem is not challenging to recreate in which case I'll be able to reference these things in my own environment. But if you have time, it couldn't hurt to have more info.
@davgordo - We do have cert-manager installed and I just checked, there is no certificate for cert-utils like the one provided in https://github.com/redhat-cop/cert-utils-operator/blob/v1.3.10/config/helmchart/templates/certificate.yaml
Let me get the data your requested
Yes, so for context. When installing via Helm, we provide cert-manager support because we're making an assumption (sometimes it's a bad assumption) that users using Helm are probably targeting plain k8s.
When the target platform is OpenShift, on the other hand, there are some built-in certificate capabilities that we can leverage instead. Specifically you'll see this config in the annotations of the cert-utils-operator-controller-manager-metrics-service
. Those annotations will essentially ask the platform to provide a certificate secret that matches up with the Service
definition.
So with that background, I just used OLM to deploy this operator, and the result looked okay to me so far. If I decode the certificate, I see the following SANS:
cert-utils-operator-controller-manager-metrics-service.cert-utils-operator.svc
cert-utils-operator-controller-manager-metrics-service.cert-utils-operator.svc.cluster.local
Those look good because they reflect the cert-utils-operator
namespace. So now I'm more curious about the certificate data and the service annotations that you are seeing in your environment.
Here is the service yaml, for DNS, how do I pull that information? I can't provide the secret since it contains certificates. I did remove the UID and IP's below.
kind: Service apiVersion: v1 metadata: annotations: service.alpha.openshift.io/serving-cert-secret-name: cert-utils-operator-certs resourceVersion: '974328279' name: cert-utils-operator-controller-manager-metrics-service managedFields:
- manager: catalog operation: Update apiVersion: v1 fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:annotations': .: {} 'f:service.alpha.openshift.io/serving-cert-secret-name': {} 'f:labels': .: {} 'f:control-plane': {} 'f:ownerReferences': .: {} .: {} 'f:apiVersion': {} 'f:blockOwnerDeletion': {} 'f:controller': {} 'f:kind': {} 'f:name': {} 'f:uid': {} 'f:spec': 'f:ports': .: {} 'k:{"port":8443,"protocol":"TCP"}': .: {} 'f:name': {} 'f:port': {} 'f:protocol': {} 'f:targetPort': {} 'f:selector': .: {} 'f:control-plane': {} 'f:sessionAffinity': {} 'f:type': {}
- manager: olm operation: Update apiVersion: v1 time: '2022-05-11T16:59:42Z' fieldsType: FieldsV1 fieldsV1: 'f:metadata': 'f:labels': 'f:operators.coreos.com/cert-utils-operator.openshift-operators': {} namespace: openshift-operators ownerReferences:
- apiVersion: operators.coreos.com/v1alpha1 kind: ClusterServiceVersion name: cert-utils-operator.v1.3.10 controller: false blockOwnerDeletion: false labels: control-plane: cert-utils-operator operators.coreos.com/cert-utils-operator.openshift-operators: '' spec: ports:
- name: https protocol: TCP port: 8443 targetPort: https selector: control-plane: cert-utils-operator clusterIP: x.x.x.x clusterIPs:
- x.x.x.x type: ClusterIP sessionAffinity: None ipFamilies:
- IPv4 ipFamilyPolicy: SingleStack status: loadBalancer: {}
Here is the DNS output.
Downloads % openssl x509 -in cert.crt -text -noout |grep DNS DNS:cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc, DNS:cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc.cluster.local Downloads %
@davgordo - I provided the information above. All seems right so why did Prometheus use the wrong server_name?
So I think it doesn't look right to me, because I thought this operator is installed in the cert-utils-operator
namespace, and the DNS on the cert would lead me to believe that it is installed in the openshift-operators
namespace.
The operator is deployed to the cert-utils-operator
namespace, right? Or did I misunderstand?
@davgordo - cert-utils is installed under openshift-operators
not cert-utils-operator
- that is why i think we are running into this issue.
@davgordo - cert-utils is installed under
openshift-operators
notcert-utils-operator
- that is why i think we are running into this issue.
Ah hah! My apologies for misunderstanding. So Prometheus is going to search for services usually by label. We can tell it what labels to search for with ServiceMonitor
configuration. I would like to see that ServiceMonitor
yaml if you can provide it.
My cluster spun down, but as soon as I spin back up, I will try to specify the openshift-operators
namespace when I install with OLM and try again to recreate.
Wild guess but, you don't happen to have a namespace called cert-utils-operator
on the same cluster, do you? Just eliminating some variables. I'm thinking a left-over Service
that wasn't cleaned up from a previous installation could cause problems.
@davgordo No, we don't have a namespace called cert-utils-operator
-- Let me check where I can pull the ServiceMonitor
@davgordo Found it and I think this might be the problem? I bolded it below.
Downloads># oc get ServiceMonitor cert-utils-operator-controller-manager-metrics-monitor -n openshift-operators -o yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2022-05-06T20:35:03Z" generation: 1 labels: control-plane: cert-utils-operator managedFields:
Now we're cookin'. Server name is wrong there. Thanks for all your help with the extra info. The problem is clear now. We'll have to do some brainstorming for a fix.
@davgordo - Yeah! Please do keep me informed. I have many clusters with this issue that i definitely want to fix.
@cigna-asoria actually, I don't know for sure whether OLM creates that service monitor automatically... Did you all configure that, or was that provided by the operator provisioning?
@davgordo - No, we did not configure that. We only upgraded/installed cert-utils instances through OperatorHub UI via the OpenShift Console. My take is that OpenShift deployed it.
@davgordo - No, we did not configure that. We only upgraded/installed cert-utils instances through OperatorHub UI via the OpenShift Console. My take is that OpenShift deployed it.
Ah I see it in my environment too. Thanks again.
@cigna-asoria FYI, I know it's not an ideal fix, but I am able to modify the serverName manually and this change does not get overwritten by the operator. This might help you temporarily until we make the next release.
@davgordo - Thanks, I will go that route until a fix is in place. Thanks again!
This issue seems to persist as the fix linked above apparently hasn't been merged, could it be re-opened?
Hi - We are on OpenShift 4.8.35 and updated our cert-utils to 1.3.10 in all our environments. But we are getting an alert message that the cert-utils metrics is down. cert-utils is installed in namespace openshift-operators and not cert-utils-operator.
The endpoint is the IP and I can get those metrics per the commands you specify in the wiki, even using the service name. But I'm getting this error:
Get "https://x.x.x.x:8443/metrics": x509: certificate is valid for cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc, cert-utils-operator-controller-manager-metrics-service.openshift-operators.svc.cluster.local, not cert-utils-operator-controller-manager-metrics-service.cert-utils-operator.svc
so, i'm wondering if the problem is in the prometheus config for server_name.
tls_config: ca_file: /etc/prometheus/certs/secret_openshift-operators_cert-utils-operator-certs_tls.crt server_name: cert-utils-operator-controller-manager-metrics-service.cert-utils-operator.svc insecure_skip_verify: false
the server_name in the Prometheus config is not valid per the error message. Can this be the problem when trying to pull metrics?