kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.92k stars 1.46k forks source link

Failed deploy model due to InternalFailure (500) #3773

Open RobinFrcd opened 3 months ago

RobinFrcd commented 3 months ago

Describe the bug ALB managed by the controller is completely unresponsive, every action on the listeners return a 500 (even on AWS console).

Only errors in the log are: {"level":"error","ts":"2024-07-16T19:51:59Z","msg":"Reconciler error","controller":"ingress","object":{"name":"alb-ext"},"namespace":"","name":"alb-ext","reconcileID":"81af4f98-71d3-4a2a-870a-b431053e6738","error":"InternalFailure: \n\tstatus code: 500, request id: 617ba366-572a-4da0-8f7f-0ef2ac3d110b"}

Steps to reproduce No idea, just tried to deploy a new ingress. The LB worked fined, I deployed an ingress which had an issue on with an invalid secretName in alb.ingress.kubernetes.io/auth-idp-oidc, but I quickly fixed it. The LB went to this degraded state just after that. So I guess the issue is closely related to https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2819, but the related secret is now fixed. And even if it wasn't fix, the whole ALB shouldn't be down because of a single misconfigured Ingress, right ? I tried to delete the ALB and let the controller recreate it, but it recreated it in the same state.

Environment

Additional Context: I also opened a ticket in AWS support, because even if the ALB is misconfigured, it shouldn't raise 500. It should give more context about the misconfiguration, right now I'm just in the dark.

RobinFrcd commented 3 months ago

Maybe a first quick fix would be to prevent sending non-base64 secrets to the LB, I guess this is what leads to all the 500 afterwards ? The ideal behaviour would be for the LB to reject such malformed secrets, but as it requires changes from AWS I don't know if this is the place to handle this.

I'm not at all familiar with go, but maybe something like that could help to prevent the global issue regarding the load balancer ? Not sure where the base64 decode part is done, tho. In https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/ingress/model_build_actions.go#L187

import (
    "encoding/base64"
    "unicode/utf8"
)

func isBase64Encoded(data []byte) bool {
    decoded, err := base64.StdEncoding.DecodeString(string(data))
    if err != nil {
        return false
    }
    return utf8.Valid(decoded)
}
if !isBase64Encoded(rawClientID) {
    return elbv2model.Action{},  errors.Errorf("clientID is not a valid base64 string: %v", secretKey)
}
if !isBase64Encoded(rawClientSecret) {
    return elbv2model.Action{},  errors.Errorf("clientSecret is not a valid base64 string: %v", secretKey)
}
shraddhabang commented 3 months ago

Thanks for bringing this to our attention. We will try to reproduce it on our side. Will you be able to provide the logs that you captured when this happened as well so that we can deep dive on those?

RobinFrcd commented 3 months ago

I tried to recover the logs but they don't go enough in time to capture what really happened. But to reproduce, you'd need to set the clientID/clientSecret with a string that is NOT a valid base64 string in the secret provided in alb.ingress.kubernetes.io/auth-idp-oidc. This will raise 500 errors once the controller tries to list the rules of the ALB.

I also had a chat with AWS support, which said that it is a known issue in the ALB:

The root cause is CreateRules API is allowing invalid XML characters when creating an Authentication Rule via CLI/SDK/CDK. This causes the DescribeRules CLI and view/edit rules in the console to return an error.

So I guess as long as the ALB is not able to reject these invalid values by itself, it would be great that the controller checks this.

shraddhabang commented 3 months ago

Thank you for confirming this. I would prefer for AWS API to fix this issue rather than adding hacks on controller. Because if in future, this limitation goes away, we will need to update the controller too.

oliviassss commented 3 months ago

Looks like the API doc does not indicate it has to be encoded via base64 for clientID and secret. https://docs.aws.amazon.com/elasticloadbalancing/latest/APIReference/API_AuthenticateOidcActionConfig.html I'd prefer not to add the validation from controller side, and improve from ELB API side instead, like returning an error message by the API so that controller can surface it.

RobinFrcd commented 3 months ago

I agree it would for sure be the best solution, but it seems this issue has been there since at least Sept. 2022. So I know this is a known issue at AWS but doesn't seem to be a high priority at all. If you have any way to give the internal ticket at AWS a higher priority it would be awesome, as at first sight, adding a simple check on rule creation doesn't seem super complicated (but I don't have the full context here).

Anyway, thanks for your answers on this issue !