Cannot remove subnets from NLB due to AWS restriction

schoeppi5 commented 9 months ago

/kind bug

What steps did you take and what happened: We are using NLB load balancers for our control plane endpoints and we recently had the case, where a customer scaled down their control plane from three nodes in three separate AZs to one node in one AZ. This caused an error in the CAPA bootstrap controller stating:

failed to reconcile load balancer" err=<
    failed to set subnets for apiserver load balancer '***': ValidationError: Subnet removal is not supported for Network Load Balancers. You must specify all existing subnets along with any new ones
        status code: 400, request id: ***

This concerns essentially the same code as issue #4357.

What did you expect to happen:

The only solution, as far as I can tell, would be to delete and recreate the NLB, if the number of subnets decreases.

Anything else you would like to add:

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/2d96288354dee44e60710048652267e2bf3e8c12/pkg/cloud/services/elb/loadbalancer.go#L113-L123

My suggested change would be:

// Reconcile the subnets and availability zones from the spec
// and the ones currently attached to the load balancer.
if len(lb.SubnetIDs) != len(spec.SubnetIDs) {
    if lb.LoadBalancerType == infrav1.LoadBalancerTypeNLB && len(lb.SubnetIDs) > len(spec.SubnetIDs) {
        err = s.deleteExistingNLBs()
        if err != nil {
            return errors.Wrapf(err, "failed to delete apiserver load balancer %q", lb.Name)
        }
        lb, err = s.createLB(spec)
        if err != nil {
            return errors.Wrapf(err, "failed to create apiserver load balancer %q", lb.Name)
        }
    } else {
        _, err := s.ELBV2Client.SetSubnets(&elbv2.SetSubnetsInput{
            LoadBalancerArn: &lb.ARN,
            Subnets:         aws.StringSlice(spec.SubnetIDs),
        })
        if err != nil {
            return errors.Wrapf(err, "failed to set subnets for apiserver load balancer '%s'", lb.Name)
        }
    }
}

Please keep in mind, that this is not tried yet.

Environment:

Cluster-api-provider-aws version: > v2.1.0

Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:40:17Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.10-dhc", GitCommit:"b8609d4dd75c5d6fba4a5eaa63a5507cb39a6e99", GitTreeState:"dirty", BuildDate:"2023-11-16T17:01:19Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

OS (e.g. from /etc/os-release): Ubuntu 20.04.6 LTS

_{Philipp Schöppner [philipp.schoeppner@mercedes-benz.com](mailto:philipp.schoeppner@mercedes-benz.com), Mercedes-Benz Tech Innovation GmbH (Provider Information)}

k8s-ci-robot commented 9 months ago

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

nrb commented 7 months ago

NLBs are definitely difficult to modify, so deleting and recreating the LB makes sense. I'm concerned that this will be disruptive to the cluster overall since the LB's disappearing during normal operation.

Can you talk more about situations when you'd remove existing subnets from the LB?

nrb commented 7 months ago

scaled down their control plane from three nodes in three separate AZs to one node in one AZ.

I had skipped past this while reading; given this scenario, I think it's reasonable to expect interruption on the LB's availability. Does that make sense to you, @schoeppi5?

schoeppi5 commented 7 months ago

Hey @nrb, thanks for your question. Yes, unavailability in this scenario is an expected and accepted behaviour, so we are fine with it. I also don't think there is a lot we could about this issue, since it is a limitation of the NLB itself.

nrb commented 7 months ago

Cool, that makes sense. And you're right, we can't do a whole lot about it since it's how NLBs work.

I do think there's a concern that someone removes a subnet and _isn'_t doing such a downscaling operation. I'd like to give this a little more thought in terms of general risk. I'll also add it to the community meeting notes to highlight it for maintainers.

schoeppi5 commented 6 months ago

Hey @nrb, I wasn't able to join the community meeting on April 8th, but I can see, that you added this PR to the meeting notes. Did you get a chance to discuss this and - if so - what was the outcome?

Were you able to give this topic some more thought? Is there anything I can do to assist?

nrb commented 6 months ago

Apologies - the community meeting did not happen on April 8 - we decided to resume the normal schedule on April 15.

I haven't given it a lot more thought, but I think this is fairly low risk given the scenarios where users would change things. I'll still bring it up and double check that others don't have objections.

nrb commented 6 months ago

From the community meeting:

Would we be able to guarantee that the NLB will get the same IP?

We may also want to make this an opt-in feature, since it is destructive and could surprise users.

Overall, no objections, though.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/4735#issuecomment-2420897711): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api-provider-aws

Cannot remove subnets from NLB due to AWS restriction #4735