nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

"ingress-class" settings for a user namespace #814

Open Milstein opened 1 week ago

Milstein commented 1 week ago

Hi NERC/MGHPCC administrators,

TL;DR

Can you help me about the "ingress-class" settings for hosting-of-medical-image-analysis-platform-ebf021?

===== main text =====

Previously our team have successfully setup cert-manager operator in

hosting-of-medical-image-analysis-platform-dcb83b (NS: D)

Yesterday I tried to setup cert-manager operator in

hosting-of-medical-image-analysis-platform-ebf021 (NS: E)

with

Issuer: letsencrypt-staging

When I tried to create ingress: auth-blt-chrisproject-org

I saw another ingress: cm-acme-http-solver-9xhfj

was generated, but unlike the ingress in NS: D,

the status of this cm-acme-http-solver-9xhfj cannot successfully ask the loadbalancer from router-default.apps.shift.nerc.mghpcc.org.

However, although this is not intended, we still can access this ingress through port: 443 (https), even though

we specified only the port: 80 (http).

I noticed that it seems like I need to setup the "ingress-class" correctly,

which may need your assistance for this setup.

Can you help me about the "ingress-class" settings for hosting-of-medical-image-analysis-platform-ebf021?

Thank you so much for your help!

Best regards,

Hsiao

computate commented 1 week ago

You can try creating an Ingress with our letsencrypt-staging-http01 ClusterIssuer, then switch it to letsencrypt-production-http01.

kind: Ingress
apiVersion: networking.k8s.io/v1
metadata:
  name: auth-blt-chrisproject-org
  namespace: hosting-of-medical-image-analysis-platform-ebf021
  annotations:
    acme.cert-manager.io/http01-ingress-class: openshift-default
    cert-manager.io/cluster-issuer: letsencrypt-staging-http01
spec:
  ingressClassName: openshift-default
  tls:
    - hosts:
        - auth-blt.chrisproject.org
      secretName: auth-blt-chrisproject-org-letsencrypt
  rules:
    - host: auth-blt.chrisproject.org
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: authentik-server
                port:
                  number: 80
Milstein commented 1 week ago

@computate: I got the follow up:

Thank you so much for the very clear instruction to setup the yaml configuration.

However, We are still experiencing the same issue even though we followed this yaml configuration.

https://console.apps.shift.nerc.mghpcc.org/k8s/ns/hosting-of-medical-image-analysis-platform-ebf021/ingresses/auth-blt-chrisproject-org/yaml

https://console.apps.shift.nerc.mghpcc.org/k8s/ns/hosting-of-medical-image-analysis-platform-ebf021/ingresses/cm-acme-http-solver-t57dd/yaml

I suspect that IngressClass: openshift-default is not granted to this new namespace (hosting-of-medical-image-analysis-platform-ebf021)

Best regards, Hsiao

Milstein commented 2 days ago

Hi Milson and the NERC administrators,

Hopefully the following observations can help about the issue.

I tried again on the old namespace (hosting-of-medical-image-analysis-platform-dcb83b) with ingress: auth-blt4-chrisproject-org

As expected, ingress: cm-acme-http-solver-ls6n4-zp2rc was automatically created and TLS was setup as "TLS is not enabled".

Image

and we successfully went through acme-http01 setup and cm-acme-http-solver ingress and pod disappeared.

However, When I tried again on the new namespace (hosting-of-medical-image-analysis-platform-ebf021) with ingress: auth-blt5-chrisproject-org

ingress: cm-acme-http-solver-cnkg6 was created, and TLS was setup as "Termination type: edge"

This causes us can only access cm-acme-http-solver through https instead of http, but http is the protocol that we need to use for http01 verification.

https://console.apps.shift.nerc.mghpcc.org/k8s/ns/hosting-of-medical-image-analysis-platform-ebf021/routes/cm-acme-http-solver-cnkg6-2qddj

Image

Milstein commented 16 hours ago

I have found a workaround for getting cert-manager to work. The nature of this workaround leads me to believe that there is a bug in how cert-manager is deployed in NERC-OCP and I encourage the NERC Engineering team to take a look. Please ping Chris Tate when he gets back!

In summary, the temporary ingress created by cert-manager for the ACME HTTP-01 challenge is incorrect. The workaround is to create the correct route manually.

First things first, I created a new issuer for our new namespace:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: letsencrypt
  namespace: chris-3114b1
spec:
  acme:
    privateKeySecretRef:
      name: letsencrypt-key-3114b1  # must be something unique
    solvers:
      - http01:
          ingress:
            class: openshift-default
    server: 'https://acme-v02.api.letsencrypt.org/directory'
    email: [dev@babymri.org](mailto:dev@babymri.org)

I created an ingress using the issuer:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: blt-chrisproject-org
  namespace: chris-3114b1
  annotations:
    cert-manager.io/issuer: letsencrypt
    acme.cert-manager.io/http01-ingress-class: openshift-default
spec:
  ingressClassName: openshift-default
  tls:
    - hosts:
        - blt.chrisproject.org
      secretName: blt-chrisproject-org-letsencrypt
  rules:
    - host: blt.chrisproject.org
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: test-chrisui
                port:
                  number: 8080

Once the ingress is created, behind the scenes the cert-manager.io operator creates the certificaterequests.cert-manager.io and challenges.acme.cert-manager.io CRD instances to obtain the certificates for HTTPS. The challenges tells us what the problem is:

$ oc describe challenge  
Name:         blt-chrisproject-org-letsencrypt-1-2491536524-792683465
Namespace:    chris-3114b1
Labels:       
Annotations:  
API Version:  acme.cert-manager.io/v1
Kind:         Challenge
Metadata:
  Creation Timestamp:  2024-11-21T07:18:09Z
  Finalizers:
    finalizer.acme.cert-manager.io
  Generation:  1
  Owner References:
    API Version:           acme.cert-manager.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Order
    Name:                  blt-chrisproject-org-letsencrypt-1-2491536524
    UID:                   050a53eb-0f53-40d9-9106-d0fe7d17d278
  Resource Version:        2745607849
  UID:                     c8bcffba-4a96-420c-bfd4-0b53b0117fc1
Spec:
  Authorization URL:  https://acme-v02.api.letsencrypt.org/acme/authz-v3/433268275247
  Dns Name:           blt.chrisproject.org
  Issuer Ref:
    Group:  cert-manager.io
    Kind:   Issuer
    Name:   letsencrypt
  Key:      rFhTjPsvGIbma7KxF85_okJfzHVlNUKENzYACAZ0Edw.zStWQ2TVIMH5e2Mvgsb2gPIX-AY4bcVrlIBTbTNjkwM
  Solver:
    http01:
      Ingress:
        Class:  openshift-default
  Token:        rFhTjPsvGIbma7KxF85_okJfzHVlNUKENzYACAZ0Edw
  Type:         HTTP-01
  URL:          https://acme-v02.api.letsencrypt.org/acme/chall-v3/433268275247/izr-QQ
  Wildcard:     false
Status:
  Presented:   true
  Processing:  true
  Reason:      Waiting for HTTP-01 challenge propagation: wrong status code '503', expected '200'
  State:       pending
Events:
  Type    Reason     Age   From                     Message
  ----    ------     ----  ----                     -------
  Normal  Started    45s   cert-manager-challenges  Challenge scheduled for processing
  Normal  Presented  45s   cert-manager-challenges  Presented challenge using HTTP-01 challenge mechanism

The important line is "wrong status code '503', expected '200'". In OpenShift console, I see the ingress created for this challenge is https://blt.chrisproject.org/.well-known/acme-challenge/rFhTjPsvGIbma7KxF85_okJfzHVlNUKENzYACAZ0Edw . At this point I tried using the https://letsdebug.net/ tool, which gave me similar error messages. The problem is that there is a mismatch in the HTTP-01 challenge path presented on NERC-OCP and the path being checked by the Let's Encrypt ACME server.

By running oc get svc I found the name of the service for the HTTP-01 challenge: cm-acme-http-solver-slmcb. My workaround is to create a "catch-all" route for this service like this: https://github.com/FNNDSC/NERC/blob/2a01e9fffea3629d58e4b88b13f1af7b722df609/blt/workaround-http01challenge-route.yml

# Workaround for broken cert-manager.io operator on NERC-OCP
# See https://mghpcc.supportsystem.com/tickets.php?id=12850

kind: Route
apiVersion: route.openshift.io/v1
metadata:
  name: letshack
  namespace: chris-3114b1
  labels:
    app.kubernetes.io/name: letshack
spec:
  host: blt.chrisproject.org
  path: /.well-known/acme-challenge # <-- "catch-all" path
  to:
    kind: Service
    # hard-coded name of service for HTTP-01 solver
    name: cm-acme-http-solver-slmcb # <-- !!!CHANGE ME!!!
  tls:
    termination: edge
    insecureEdgeTerminationPolicy: Allow
    destinationCACertificate: ''
  port:
    targetPort: http

The execution of this workaround is time-sensitive because the Let's Encrypt server has punishing rate limits for invalid certificate requests. If done successfully, the challenge is solved via the workaround route and the certificate is granted successfully. At this point, the cert-manager.io operator performs clean-up and switches over to steady-state. svc/cm-acme-http-solver-slmcb was deleted automatically, then I manually deleted route/letshack.

As you can see, the workaround is inconvenient and brittle. It works for now, but I hope the problem can be fixed on the NERC-side of things before certificates expire in February.

Milstein commented 16 hours ago

Hi NERC Administrators and Jennings,

I believe that the root cause is that the ingressClass is more like by-namespace setting, or some domain-name => namespaces mapping.

If ingressClass can be used across namespaces, then it is actually extremely dangerous because any user on NERC can register blt.chrisproject.org / app.chrisproject.org in their own namespace and hijack our websites.

Best regards, Hsiao

computate commented 14 hours ago

One thing they could try is changing the DNS CNAME record to something unique for your project like blt-chrisproject-org.apps.shift.nerc.mghpcc.org, instead of router-default.apps.shift.nerc.mghpcc.org. I'm not sure if that would change anything, or if more than one domain configured in OpenShift like an Ingress in the smart-village-faeeb6c namespace which is no longer used was also configured with CNAME value router-default.apps.shift.nerc.mghpcc.org for www.smartabyarsmartvillage.org.

jennydaman commented 14 hours ago

Confirming @computate's workaround works.

Do we know why router-default.apps.shift.nerc.mghpcc.org used to work, but doesn't work anymore? Is your best practice recommendation to use a universally unique subdomain name of .apps.shift.nerc.mghpcc.org for our CNAMEs from now on?

larsks commented 12 hours ago

In summary, the temporary ingress created by cert-manager for the ACME HTTP-01 challenge is incorrect. The workaround is to create the correct route manually.

I don't think this is correct.

I was able to deploy this repository without any problems; after a few minutes, I had a valid certificate. This worked exactly as expected; it did not produce any errors nor require any workarounds.

Do we know why router-default.apps.shift.nerc.mghpcc.org used to work, but doesn't work anymore? Is your best practice recommendation to use a universally unique subdomain name of .apps.shift.nerc.mghpcc.org for our CNAMEs from

@jennydaman I don't think this is necessary. If you have a hostname that is a CNAME to ANYTHING.apps.shift..., it will work just fine. Neither LetsEncrypt nor cert-manager care about the specific target of the CNAME; they simply care that the hostname itself ultimately resolves to the address of the ingress service. I was able to successfully acquire certificates for the following hostnames:

cert-example.oddbit.com.       900     IN  CNAME  default-router.apps.shift.nerc.mghpcc.org.
cert-example-1.oddbit.com.     900     IN  CNAME  router-default.apps.shift.nerc.mghpcc.org.
cert-example-2.oddbit.com.     900     IN  CNAME  electric-purple-sheep.apps.shift.nerc.mghpcc.org.

They all resulted in successfully issuing a certificate for the requested name.

jennydaman commented 10 hours ago

@larsks I looked at your repo and noticed differences in how things are being done.

You are creating a certificate explicitly in certificate.yaml, using ClusterIssuer/letsencrypt-production-http01. In my case, I am using a (namespace-local) issuer. The certificate is created for me because I use cert-manager.io annotations.

jennydaman commented 10 hours ago

@larsks I forked your repo so that it uses a domain I control, and a namespace-local Issuer instead of ClusterIssuer. https://github.com/jennydaman/cert-example

This is running on NERC-OCP in the hosting-of-medical-image-analysis-platform-b9bc25 project. I was able to reproduce the problem for a bit but now it's working, so I'm not sure what's going on.

larsks commented 9 hours ago

@jennydaman I'm glad it's working. I was unable to produce any errors:

If you see this behavior crop up again, please let me know and we'll see if we can track it down.