tnozicka / openshift-acme

ACME Controller for OpenShift and Kubernetes Cluster. (Supports e.g. Let's Encrypt)
Apache License 2.0
319 stars 116 forks source link

Temporary exposer route got stuck #140

Closed alex-sentsiurou closed 3 years ago

alex-sentsiurou commented 4 years ago

What happened: Certificate fails to get provisioned because controller creates and delete new exposer pods after a new route is added with kubernetes.io/tls-acme=true annotation.

What you expected to happen: A valid ACME certificate should be assigned to the route (live environment is deployed). Also, the exposer pod should be deleted after serving http challenge.

How to reproduce it (as minimally and precisely as possible): Install cluster-wide live controller. Deploying a Sample Application (I tried with other routes and apps as well): oc new-project example-project oc create -fhttps://raw.githubusercontent.com/tnozicka/gohellouniverse/master/deploy/{deployment,service}.yaml oc create route edge gohellouniverse --service=gohellouniverse oc patch route gohellouniverse -p '{"metadata":{"annotations":{"kubernetes.io/tls-acme":"true"}}}'

Anything else we need to know?: here's a part of controller logs (endless repeat):

I0715 12:23:50.814190       1 route.go:496] Started syncing Route "example-project/gohellouniverse"
I0715 12:23:50.815605       1 route.go:563] Route "example-project/gohellouniverse" needs new certificate: Route is missing CertKey
I0715 12:23:50.817162       1 route.go:607] Using ACME client with DirectoryURL "https://acme-v02.api.letsencrypt.org/directory"
I0715 12:23:51.632110       1 route.go:650] Route "example-project/gohellouniverse": Order "https://acme-v02.api.letsencrypt.org/acme/order/91391673/4216781579" is in "pending" state
I0715 12:23:51.632183       1 route.go:655] Route "example-project/gohellouniverse": Order "https://acme-v02.api.letsencrypt.org/acme/order/91391673/4216781579" contains 1 authorization(s)
I0715 12:23:51.930592       1 route.go:663] Route "example-project/gohellouniverse": order "https://acme-v02.api.letsencrypt.org/acme/order/91391673/4216781579": authz "https://acme-v02.api.letsencrypt.org/acme/authz-v3/5895984458": is in "pending" state
I0715 12:23:51.930725       1 route.go:690] route "example-project/gohellouniverse": order "https://acme-v02.api.letsencrypt.org/acme/order/91391673/4216781579": authz "https://acme-v02.api.letsencrypt.org/acme/authz-v3/5895984458": challenge "pending" is in "pending" state
I0715 12:23:51.931290       1 route.go:1001] exposer Route example-project/exposer-a22m8b0algd0o7t8h3b7fct2mfkhro3ehmjaq5tn54e03rj9tcbg isn't admitted yet
I0715 12:23:51.932334       1 route.go:498] Finished syncing Route "example-project/gohellouniverse"
$ oc get routes -w
NAME                                                           HOST/PORT                                                    PATH
                                          SERVICES                                                       PORT    TERMINATION   WILDCARD
exposer-a22m8b0algd0o7t8h3b7fct2mfkhro3ehmjaq5tn54e03rj9tcbg   HostAlreadyClaimed                                           /.well-known/acme-challenge/FrvghT_GdpJWJ0c0lDvh06LHO2iYc5yfsTV74E8_T0g   exposer-a22m8b0algd0o7t8h3b7fct2mfkhro3ehmjaq5tn54e03rj9tcbg   <all>   edge/Allow    None
exposer-gds3hchr7id62tm6vabsajrakkuq3prb5maj9uu06q2jgbmvcp6g   gohellouniverse-example-project.apps.okd4.okd.gomel.iba.by   /.well-known/acme-challenge/FrvghT_GdpJWJ0c0lDvh06LHO2iYc5yfsTV74E8_T0g   exposer-gds3hchr7id62tm6vabsajrakkuq3prb5maj9uu06q2jgbmvcp6g   <all>   edge/Allow    None
gohellouniverse                                                gohellouniverse-example-project.apps.okd4.okd.gomel.iba.by
                                          gohellouniverse                                                <all>   edge          None

The temporary route created for http challenge is responsive and returns the secret:

curl -X GET gohellouniverse-example-project.apps.okd4.okd.gomel.iba.by/.well-known/acme-challenge/FrvghT_GdpJWJ0c0lDvh06LHO2iYc5yfsTV74E8_T0g
FrvghT_GdpJWJ0c0lDvh06LHO2iYc5yfsTV74E8_T0g.N-dKBfHTEleCbUiR3NqK18i8l92p44zfceKQO2ZUxL8

Environment (OKD 4.4 on bare metal): Client Version: 4.4.0-0.okd-2020-05-23-055148-beta5 Server Version: 4.4.0-0.okd-2020-05-23-055148-beta5 Kubernetes Version: v1.17.1

tux-o-matic commented 4 years ago

Same experience on OCP 4.3

tux-o-matic commented 4 years ago

@tnozicka, this is a blocker. If you don' have time on your hands to fix it, could you maybe point us in the area of the code that needs attention for a solution?

JavierLeonPeris commented 4 years ago

Same experience on OCP 4.5

ggrames commented 3 years ago

Same experience on OCP 4.5 is there already a workaround?

JavierLeonPeris commented 3 years ago

we couldn't solve it, we've also tried cert-manager but didn't work either.

tnozicka commented 3 years ago

Can you attach the full (redacted) yaml for those routes? I have tried with a few days old OCP 4.7-ci in AWS and the cert was provisioned and the temporary route was deleted.

It this happening consistently for you? Does it recover when you delete the temporary routes? What is the version of openshift-acme you are using?

tux-o-matic commented 3 years ago

Hey @tnozicka. There is a race condition and the acme controller creates the same route again which gets rejected by OCP since an older one declared the same Route. It's nice to see you test this but don't you have an OCP release that is GA to compare with what we've been experiencing? I'll let somebody else give an example.

ggrames commented 3 years ago

Hi, these are the 2 exposer routes and the external which should get the cert

openshift-acme_notWorking_routes.txt

name: docpipe5-external

tnozicka commented 3 years ago

Hi,

I don't think this depends on OCP version used but more on how slow your infra/informers are and if the informers see the update before next sync loop on the same item. It should generate the same name for the same challenge and not to create a new one which is how it avoids the race. https://github.com/tnozicka/openshift-acme/blob/6955c94/pkg/controller/route/route.go#L694-L695 Although the dump suggests otherwise:

creationTimestamp: "2020-12-04T10:06:24Z"
name: exposer-1fv03q7jublbj4a1i50q53aub2g21brui5hbegkfofsphn7pp7h0
path: /.well-known/acme-challenge/o6IT_Jq0C3TTaCdtpbyVU_ce1dhDQgZrai7_uCGyVa0
---
creationTimestamp: "2020-12-04T10:06:25Z"
name: exposer-gcl2qecrlhckaicnst3fn1e4cne88op30m2fvtpjoc74ve60jlqg
path: /.well-known/acme-challenge/o6IT_Jq0C3TTaCdtpbyVU_ce1dhDQgZrai7_uCGyVa0

Including challengePath would seem like an easy way to avoid it but I'd like to see the controller logs from around the time these 2 exposer routes were created to see order.URI, authzURL and challenge.URI values if possible.

tnozicka commented 3 years ago

/priority important-soon /assign

ggrames commented 3 years ago

Hi, Today i have seen you have pushed today at 13.00 middle european Time a new image to quay I tried it again and i have seen, that the challenge route is now correct created. But still no success at the moment with it

Here is the output of an openshift-acme pod:

openshift-acme-pod-output.txt 10min_openshift-acme-pod-output.txt

tux-o-matic commented 3 years ago

Can confirm success on OCP 4.6.

ggrames commented 3 years ago

have you reinstalled the whole openshift acme app, with service account, roles, ... ? Is the route for the renew in the same namespace in your setup? Thx

tux-o-matic commented 3 years ago

@ggrames. The PR only touches the Controller code, not exposer. So if you used the example Deployment with pull: Always then force a redeploy of the controller, that should be enough. If you had validation code pending, not sure if the new Controller will pick that up or if you should recreate the Route. I tested cluster-wide setup, where the acme controller is in its own namespace.

ggrames commented 3 years ago

@tux-o-matic i have already recreated the route and yes i use pull always. Also i have already scaled the openshift-acme pods to 0 und up to 2 again. So it should be up2date There have to be another problem.