Bug: Load Balancer Recreating

tonybolzan commented 1 year ago

In an unusual sequence of events that oracle reclaim the preemptible nodes multiple times in a day, this Ingress bugged and created a new Load Balancer. If IP are dynamic the DNS going to send traffic to wrong IP, if IP are reserved the new LB are not to be created because of conflict.

tonybolzan commented 1 year ago

Update: Even switching to OnDemand, the Load Balancer was recreated, leaving 2 LB with the same name, one unconfigured and the other configured.

tonybolzan commented 1 year ago

The log on a real event

I0905 19:08:25.473783       1 routingpolicy.go:105] "Finished syncing routing policies for ingress class" ingressClass="test-ingress-class" duration="709.868µs
I0905 19:08:35.473190       1 routingpolicy.go:103] "Started syncing routing policies for ingress class" ingressClass="test-ingress-class" startTime="2023-09-05 19:08:35.473166461 +0000 UTC m=+275713.251299440
I0905 19:08:35.473287       1 util.go:97] Listener paths for routing policy: {...big json...}
I0905 19:08:35.473325       1 loadbalancer.go:143] Refreshing LB cache for lb ocid1.loadbalancer.oc1.sa-saopaulo-1.aaaaaaaapy5axf6sofjsgis3fjxdtda6caedrep526vrar6wknjsqszyteua 
I0905 19:08:42.825880       1 backend.go:113] "Finished syncing backends for ingress class" ingressClass="test-ingress-class" duration="17.352789779s"
I0905 19:08:42.826243       1 backend.go:486] Error syncing backends for ingress class test-ingress-class: unable to fetch backendset health: Error returned by LoadBalancer Service.
                              Http Status Code: 404.
                              Error Code: NotAuthorizedOrNotFound.
                              Opc request id: d6c135b589e008048cd49474bea5d0df/B96EBC8AAD6A1B7A6F9AAB51573F7F01/A2704054B20A333C54A2E86B074E86E5. 
                              Message: Authorization failed or requested resource not found.

                              Operation Name: GetBackendSetHealth
                              Timestamp: 2023-09-05 19:08:25 +0000 GMT
                              Client Version: Oracle-GoSDK/65.34.0
                              Request Endpoint: GET https://iaas.sa-saopaulo-1.oraclecloud.com/20170115/loadBalancers/ocid1.loadbalancer.oc1.sa-saopaulo-1.aaaaaaaapy5axf6sofjsgis3fjxdtda6caedrep526vrar6wknjsqszyteua/backendSets/bs_a784ef83523e5f6/health
                              Troubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.
                              Also see https://docs.oracle.com/iaas/api/#/en/loadbalancer/20170115/BackendSetHealth/GetBackendSetHealth for details on this operation's requirements.
                              To get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.
                              If you are unable to resolve this LoadBalancer issue, please contact Oracle support and provide them this full error message.

I0905 19:08:42.826271       1 backend.go:111] "Started syncing backends for ingress class" ingressClass="test-ingress-class" startTime="2023-09-05 19:08:42.826262535 +0000 UTC m=+275720.604395515"
I0905 19:08:42.826314       1 loadbalancer.go:143] Refreshing LB cache for lb ocid1.loadbalancer.oc1.sa-saopaulo-1.aaaaaaaapy5axf6sofjsgis3fjxdtda6caedrep526vrar6wknjsqszyteua
I0905 19:08:50.724727       1 loadbalancer.go:143] Refreshing LB cache for lb ocid1.loadbalancer.oc1.sa-saopaulo-1.aaaaaaaapy5axf6sofjsgis3fjxdtda6caedrep526vrar6wknjsqszyteua
I0905 19:09:00.114667       1 webhook.go:59] "processing pod creation for pod readiness" pod="piperun/"
I0905 19:09:00.983975       1 reflector.go:559] /workspace/main.go:139: Watch close - *v1.Service total 10 items received
I0905 19:09:09.489571       1 reflector.go:281] /workspace/main.go:133: forcing resync
I0905 19:09:09.489974       1 ingressclass.go:108] "Updating ingress class" ingressClass="test-ingress-class"
I0905 19:09:09.489997       1 ingressclass.go:159] "Started syncing ingress class" ingressClass="test-ingress-class" startTime="2023-09-05 19:09:09.489987392 +0000 UTC m=+275747.268120372"
I0905 19:09:09.490037       1 loadbalancer.go:143] Refreshing LB cache for lb ocid1.loadbalancer.oc1.sa-saopaulo-1.aaaaaaaapy5axf6sofjsgis3fjxdtda6caedrep526vrar6wknjsqszyteua
I0905 19:09:09.513111       1 ingressclass.go:235] "Creating load balancer for ingress class" ingressClass="test-ingress-class"
I0905 19:09:09.513208       1 ingressclass.go:270] Create lb request: {...big json...}

Inbaraj-S commented 1 year ago

Hi @tonybolzan backendset not found is transient error, it should go away once LB gets created and work requests are successful. For the bugged LB, which is the Loadbalancer IP(bugged lb or configured LB) and id updated to the Ingress? Can you confirm on that.

sdominguez-quistor commented 1 year ago

same for here

tonybolzan commented 1 year ago

These are the logs at the exact moment native-ingress-controller created a new load balancer when it shouldn't have. The 404 of backendset can be, or cannot be related to the Bug of a new LB creation.

At this moment, the Native Ingress Load Balancer has been removed to create a new one using Nginx Ingress. I tried to keep Native Ingress and Nginx Ingress together but the Deployment Rollout are getting stuck, and any new deployment are not completed successfully. Problably related to Namespace label podreadiness.ingress.oraclecloud.com/pod-readiness-gate-inject

I filled a SR 3-34189913991 in de MOS with more information about the problem.

tonybolzan commented 1 year ago

Looking the logs and the code:

ingressclass.go::ensureLoadBalancer() are called and then call c.getLoadBalancer(ic) that returned lb == nil
The nil are from loadbalancer.go::getLoadBalancer() call lbc.getLoadBalancerBustCache(ctx, lbID)
In loadbalancer.go::getLoadBalancerBustCache() every http error are treated the same, returning nil

This behavior means that a temporary unavailability or even a faulty request forces the creation of a new Load Balancer. Differentiating errors and treating them in a non-generic way should resolve the situation. Raising the tool's reliability level with retrys and exponential backoff should help in a network failure.

sdominguez-quistor commented 1 year ago

I have a question about this. When the new load balancer is created, is there any way to assign the same IP as the previous one?

Inbaraj-S commented 1 year ago

Will be taken care in upcoming release.

Inbaraj-S commented 1 year ago

Fixed as part of https://github.com/oracle/oci-native-ingress-controller/releases/tag/v1.2.0

oracle / oci-native-ingress-controller

Bug: Load Balancer Recreating #20