Open Milstein opened 1 week ago
You can try creating an Ingress with our letsencrypt-staging-http01
ClusterIssuer, then switch it to letsencrypt-production-http01
.
kind: Ingress
apiVersion: networking.k8s.io/v1
metadata:
name: auth-blt-chrisproject-org
namespace: hosting-of-medical-image-analysis-platform-ebf021
annotations:
acme.cert-manager.io/http01-ingress-class: openshift-default
cert-manager.io/cluster-issuer: letsencrypt-staging-http01
spec:
ingressClassName: openshift-default
tls:
- hosts:
- auth-blt.chrisproject.org
secretName: auth-blt-chrisproject-org-letsencrypt
rules:
- host: auth-blt.chrisproject.org
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: authentik-server
port:
number: 80
@computate: I got the follow up:
Thank you so much for the very clear instruction to setup the yaml configuration.
However, We are still experiencing the same issue even though we followed this yaml configuration.
I suspect that IngressClass: openshift-default is not granted to this new namespace (hosting-of-medical-image-analysis-platform-ebf021)
Best regards, Hsiao
Hi Milson and the NERC administrators,
Hopefully the following observations can help about the issue.
I tried again on the old namespace (hosting-of-medical-image-analysis-platform-dcb83b) with ingress: auth-blt4-chrisproject-org
As expected, ingress: cm-acme-http-solver-ls6n4-zp2rc was automatically created and TLS was setup as "TLS is not enabled".
and we successfully went through acme-http01 setup and cm-acme-http-solver ingress and pod disappeared.
However, When I tried again on the new namespace (hosting-of-medical-image-analysis-platform-ebf021) with ingress: auth-blt5-chrisproject-org
ingress: cm-acme-http-solver-cnkg6 was created, and TLS was setup as "Termination type: edge"
This causes us can only access cm-acme-http-solver through https instead of http, but http is the protocol that we need to use for http01 verification.
I have found a workaround for getting cert-manager to work. The nature of this workaround leads me to believe that there is a bug in how cert-manager is deployed in NERC-OCP and I encourage the NERC Engineering team to take a look. Please ping Chris Tate when he gets back!
In summary, the temporary ingress created by cert-manager for the ACME HTTP-01 challenge is incorrect. The workaround is to create the correct route manually.
First things first, I created a new issuer for our new namespace:
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: letsencrypt
namespace: chris-3114b1
spec:
acme:
privateKeySecretRef:
name: letsencrypt-key-3114b1 # must be something unique
solvers:
- http01:
ingress:
class: openshift-default
server: 'https://acme-v02.api.letsencrypt.org/directory'
email: [dev@babymri.org](mailto:dev@babymri.org)
I created an ingress using the issuer:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: blt-chrisproject-org
namespace: chris-3114b1
annotations:
cert-manager.io/issuer: letsencrypt
acme.cert-manager.io/http01-ingress-class: openshift-default
spec:
ingressClassName: openshift-default
tls:
- hosts:
- blt.chrisproject.org
secretName: blt-chrisproject-org-letsencrypt
rules:
- host: blt.chrisproject.org
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: test-chrisui
port:
number: 8080
Once the ingress is created, behind the scenes the cert-manager.io operator creates the certificaterequests.cert-manager.io and challenges.acme.cert-manager.io CRD instances to obtain the certificates for HTTPS. The challenges tells us what the problem is:
$ oc describe challenge
Name: blt-chrisproject-org-letsencrypt-1-2491536524-792683465
Namespace: chris-3114b1
Labels:
Annotations:
API Version: acme.cert-manager.io/v1
Kind: Challenge
Metadata:
Creation Timestamp: 2024-11-21T07:18:09Z
Finalizers:
finalizer.acme.cert-manager.io
Generation: 1
Owner References:
API Version: acme.cert-manager.io/v1
Block Owner Deletion: true
Controller: true
Kind: Order
Name: blt-chrisproject-org-letsencrypt-1-2491536524
UID: 050a53eb-0f53-40d9-9106-d0fe7d17d278
Resource Version: 2745607849
UID: c8bcffba-4a96-420c-bfd4-0b53b0117fc1
Spec:
Authorization URL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/433268275247
Dns Name: blt.chrisproject.org
Issuer Ref:
Group: cert-manager.io
Kind: Issuer
Name: letsencrypt
Key: rFhTjPsvGIbma7KxF85_okJfzHVlNUKENzYACAZ0Edw.zStWQ2TVIMH5e2Mvgsb2gPIX-AY4bcVrlIBTbTNjkwM
Solver:
http01:
Ingress:
Class: openshift-default
Token: rFhTjPsvGIbma7KxF85_okJfzHVlNUKENzYACAZ0Edw
Type: HTTP-01
URL: https://acme-v02.api.letsencrypt.org/acme/chall-v3/433268275247/izr-QQ
Wildcard: false
Status:
Presented: true
Processing: true
Reason: Waiting for HTTP-01 challenge propagation: wrong status code '503', expected '200'
State: pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Started 45s cert-manager-challenges Challenge scheduled for processing
Normal Presented 45s cert-manager-challenges Presented challenge using HTTP-01 challenge mechanism
The important line is "wrong status code '503', expected '200'". In OpenShift console, I see the ingress created for this challenge is https://blt.chrisproject.org/.well-known/acme-challenge/rFhTjPsvGIbma7KxF85_okJfzHVlNUKENzYACAZ0Edw . At this point I tried using the https://letsdebug.net/ tool, which gave me similar error messages. The problem is that there is a mismatch in the HTTP-01 challenge path presented on NERC-OCP and the path being checked by the Let's Encrypt ACME server.
By running oc get svc I found the name of the service for the HTTP-01 challenge: cm-acme-http-solver-slmcb. My workaround is to create a "catch-all" route for this service like this: https://github.com/FNNDSC/NERC/blob/2a01e9fffea3629d58e4b88b13f1af7b722df609/blt/workaround-http01challenge-route.yml
# Workaround for broken cert-manager.io operator on NERC-OCP
# See https://mghpcc.supportsystem.com/tickets.php?id=12850
kind: Route
apiVersion: route.openshift.io/v1
metadata:
name: letshack
namespace: chris-3114b1
labels:
app.kubernetes.io/name: letshack
spec:
host: blt.chrisproject.org
path: /.well-known/acme-challenge # <-- "catch-all" path
to:
kind: Service
# hard-coded name of service for HTTP-01 solver
name: cm-acme-http-solver-slmcb # <-- !!!CHANGE ME!!!
tls:
termination: edge
insecureEdgeTerminationPolicy: Allow
destinationCACertificate: ''
port:
targetPort: http
The execution of this workaround is time-sensitive because the Let's Encrypt server has punishing rate limits for invalid certificate requests. If done successfully, the challenge is solved via the workaround route and the certificate is granted successfully. At this point, the cert-manager.io operator performs clean-up and switches over to steady-state. svc/cm-acme-http-solver-slmcb was deleted automatically, then I manually deleted route/letshack.
As you can see, the workaround is inconvenient and brittle. It works for now, but I hope the problem can be fixed on the NERC-side of things before certificates expire in February.
Hi NERC Administrators and Jennings,
I believe that the root cause is that the ingressClass is more like by-namespace setting, or some domain-name => namespaces mapping.
If ingressClass can be used across namespaces, then it is actually extremely dangerous because any user on NERC can register blt.chrisproject.org / app.chrisproject.org in their own namespace and hijack our websites.
Best regards, Hsiao
One thing they could try is changing the DNS CNAME record to something unique for your project like blt-chrisproject-org.apps.shift.nerc.mghpcc.org
, instead of router-default.apps.shift.nerc.mghpcc.org
. I'm not sure if that would change anything, or if more than one domain configured in OpenShift like an Ingress in the smart-village-faeeb6c
namespace which is no longer used was also configured with CNAME value router-default.apps.shift.nerc.mghpcc.org
for www.smartabyarsmartvillage.org
.
Confirming @computate's workaround works.
Do we know why router-default.apps.shift.nerc.mghpcc.org
used to work, but doesn't work anymore? Is your best practice recommendation to use a universally unique subdomain name of .apps.shift.nerc.mghpcc.org
for our CNAMEs from now on?
In summary, the temporary ingress created by cert-manager for the ACME HTTP-01 challenge is incorrect. The workaround is to create the correct route manually.
I don't think this is correct.
I was able to deploy this repository without any problems; after a few minutes, I had a valid certificate. This worked exactly as expected; it did not produce any errors nor require any workarounds.
Do we know why
router-default.apps.shift.nerc.mghpcc.org
used to work, but doesn't work anymore? Is your best practice recommendation to use a universally unique subdomain name of.apps.shift.nerc.mghpcc.org
for our CNAMEs from
@jennydaman I don't think this is necessary. If you have a hostname that is a CNAME to ANYTHING.apps.shift...
, it will work just fine. Neither LetsEncrypt nor cert-manager care about the specific target of the CNAME; they simply care that the hostname itself ultimately resolves to the address of the ingress service. I was able to successfully acquire certificates for the following hostnames:
cert-example.oddbit.com. 900 IN CNAME default-router.apps.shift.nerc.mghpcc.org.
cert-example-1.oddbit.com. 900 IN CNAME router-default.apps.shift.nerc.mghpcc.org.
cert-example-2.oddbit.com. 900 IN CNAME electric-purple-sheep.apps.shift.nerc.mghpcc.org.
They all resulted in successfully issuing a certificate for the requested name.
@larsks I looked at your repo and noticed differences in how things are being done.
You are creating a certificate explicitly in certificate.yaml
, using ClusterIssuer/letsencrypt-production-http01
. In my case, I am using a (namespace-local) issuer. The certificate is created for me because I use cert-manager.io
annotations.
@larsks I forked your repo so that it uses a domain I control, and a namespace-local Issuer instead of ClusterIssuer. https://github.com/jennydaman/cert-example
This is running on NERC-OCP in the hosting-of-medical-image-analysis-platform-b9bc25
project. I was able to reproduce the problem for a bit but now it's working, so I'm not sure what's going on.
@jennydaman I'm glad it's working. I was unable to produce any errors:
If you see this behavior crop up again, please let me know and we'll see if we can track it down.