Authorization status "invalid" leads to routes being paused indefinitely

tnozicka / openshift-acme

ACME Controller for OpenShift and Kubernetes Cluster. (Supports e.g. Let's Encrypt)

Apache License 2.0

319 stars 116 forks source link

Authorization status "invalid" leads to routes being paused indefinitely #78

Closed hansmi closed 4 years ago

hansmi commented 6 years ago

The RouteController.handle function sets a kubernetes.io/tls-acme-paused annotation if the API returned a status of invalid:

https://github.com/tnozicka/openshift-acme/blob/f0608627f45f8cce432c1d6b6625d0add42a94c1/pkg/controllers/route/route.go#L572-L582

Once that annotation is set the route is skipped indefinitely. There is no code removing the annotation. Instead a manual intervention is necessary.

One would expect that such routes are retried after a reasonable timeframe, i.e. a day.

tnozicka commented 6 years ago

yeah, this is a hot fix to avoid getting rate-limited. Will be replaced by having rate limits/backoff build into the controller which is at the top of my list

openshift-bot commented 5 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

hansmi commented 5 years ago

/lifecycle frozen

jameseck commented 5 years ago

I've just deployed this on a fresh Openshift 3.11 cluster and all the Routes that I've annotated have failed verification and gone into this paused state. If I remove the paused annotation, the same problem reoccurs and the Route is paused again. Is there any progress on fixing this? It seems like this is unusable currently which is a real shame.

tnozicka commented 5 years ago

@jameseck getting paused is not the error but a manifestation of something set up incorrectly. Let's encrypt validation fails, and it gets paused so your account doesn't waste the rate limit and you can fix it before the next try. The issue you need to solve is to setup the cluster correctly so the validation doesn't fail.

Make sure the route can actually reach your app and you can reach the temporary Route (/.well-known/...) manually from outside of the cluster and that your DNS record points to the router.

jameseck commented 5 years ago

Thanks for the response and thanks for the project! I misunderstood the actual issue and solved it by fixing the incorrect settings in my haproxy LB that's in front of this cluster. It would still be nice to have it retry paused routes periodically, but not urgent now.

tnozicka commented 5 years ago

Yeah, I've was just trying to help you understand this is not a blocker for it to work. Glad you figured out the settings.

I agree this is something that needs to be addressed eventually and is #2 on my list when I get https://github.com/tnozicka/openshift-acme/pull/92 in.

To slightly ease out the burden before we get rate limiting:

To list all the paused Routes: oc get route -A -o json | jq -r '.items[] | select(.metadata.annotations."kubernetes.io/tls-acme-paused") | "-n \(.metadata.namespace) \(.metadata.name)"'

To retry all the paused Routes: oc get route -A -o json | jq -r '.items[] | select(.metadata.annotations."kubernetes.io/tls-acme-paused") | "-n \(.metadata.namespace) \(.metadata.name)"' | xargs -n3 oc patch route -p='{"metadata": {"annotations": {"kubernetes.io/tls-acme-paused": null}}}'

I imagine one could setup a CronJob running the retry script like once a week to always force the retry before native support arrives.