Controllers should retry and re-queue based on status codes

alex-bezek commented 12 months ago

What happened

Today, when trying to make API calls to ngrok in some of our controllers, we receive a generic ngrok api error and thus re-que the event to be reconciled again immediately.

What you think should happen instead

We should instead see the ngrok api error and re-queue events appropriately.

If the error is >=500 we should re-queue as usual. This should mean there is some connectivity issue or a problem in the moment with our API that should be retried
if the error is specifically a 429, we should re-queue with an extended backoff time
for other 400 bad request errors, we should not re-queue the event. This is the case where you try to reserve an already reserved name, or something in general is not being accepted by our API and changes need to happen.

CK-Ward commented 11 months ago

@Wonderful-Space and @nikolay-ngrok both interested, will prioritize accordingly.

alex-bezek commented 11 months ago

In the issue description I had noted

for other 400 bad request errors, we should not re-queue the event

However, we recently found an issue in the httpsEdge controller where it currently is doing this (not retrying for 400's) and it's causing an issue.

To recap how the Ingress controller works overall a bit, there are multiple controllers all working asynchronously in concert to drive the system to an eventually consistent state. Specifically, when a user creates an ingress object, the _ingresscontroller converts this information into Domain, HttpsEdge, and Tunnel CRD's in the k8s api. Then each of these respective controllers work in parallel to reconcile these CRDs by reserving domains, creating edges (and the other related configurations), and spinning up ngrok-go tunnels.

This parallel and async nature means that the https_edge_controller could try to create the edge before the domain controller has reserved the domain. By default, if a controller returns an error during its reconcile loop, it will re-queue the event to retry later. This could be seen in this issue where we consistently saw a slew of errors and retries https://github.com/ngrok/kubernetes-ingress-controller/issues/195

Recently we updated the edge controller to not retry on certain ngrok api errors with the aim of not continually retrying bad route-module configurations. We now take the error and check if its "retryable" based on its >500 https://github.com/ngrok/kubernetes-ingress-controller/blob/41940b5d11532eb6393e3396ed95fb62a0aa68dd/internal/controllers/httpsedge_controller.go#L129-L137 https://github.com/ngrok/kubernetes-ingress-controller/blob/b603c501e74f0b394e8ee1c3e0b9cd8ea3a6716e/internal/errors/errors.go#L157-L172

We realized though that when creating an edge, during this async controller race, the edge controller can fail to create the edge if the domain isn't registered which is a 400 error and should be retried soon

"HTTP 400: The domain 'my-domain.ngrok.app:443' is not reserved.

This may be worth breaking out into its own PR to resolve or its own issue, but I thought it might be worth leaving a note here to take into account for any other possible API error codes that should be retried.

@nikolay-ngrok also brought up the concern that if we just retry errors like this all the time, we could put unnecessary pressure on the API. With the controller-runtime library, we can control if the event should requeue and how long it should wait based on https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Result . I think internally it also does some backoff, but we'd have to look into that. It may be worth looking into having a more robust exponential backoff for particular errors as talked about in some of this guides https://stuartleeks.com/posts/error-back-off-with-controller-runtime/ https://danielmangum.com/posts/controller-runtime-client-go-rate-limiting/

ngrok / kubernetes-ingress-controller

Controllers should retry and re-queue based on status codes #283

What happened

What you think should happen instead