outscale / cluster-api-provider-outscale

BSD 3-Clause "New" or "Revised" License
4 stars 10 forks source link

[Bug]: Broken LBU impacts current prod #380

Open pierreozoux opened 1 week ago

pierreozoux commented 1 week ago

What happened

It seems to me that the LBU is somehow broken, and our TAM Ilane confirmed me this, that it is currently a known bug on your side. But I don't know the bug number on your tracker.

Basically, the security group in a LBU doesn't work for a new node, until we attach a public IP.

Step to reproduce

So it impacts us currently in 2 ways:

Node reboot

The first time it happened, it was in April 2024. A node of the control plane rebooted, and it came back to life, but couldn't reach the kubeAPI LBU. To investigate, I plugged in a public IP to debug the node, and suddenly, it worked again. To me everything seems alright in terms of security groups, so I don't understand why attaching a public IP could solve the issue. But somehow it did.

Node upgrade

Now, I want to upgrade from 1.27 to 1.28, but when the new control plane appears, it can't reach the LBU, and if I attach the public IP, same is happening, suddenly, it can.

Expected to happen

I'd like to be able to reboot a node or upgrade my cluster.

Add anything

The internal ticket on your side, in the support for me is: 374117, 378531

The fact that I can't upgrade is already worrying, but the fact that if a node reboot, my cluster will be down again is really worrying me.

cluster-api output

NA

Environment

- Kubernetes version: (use `kubectl version`): 1.27.9
- OS (e.g. from `/etc/os-release`): NA
- Kernel (e.g. `uname -a`): NA
- cluster-api-provider-outscale version: v0.3.1
- cluster-api version: v1.5.5
- Install tools: NA
- Kubernetes Distribution: NA
- Kubernetes Diestribution version: NA
outscale-hmi commented 4 days ago

Thank you for providing details on the connectivity issues with the load balancer after a node reboot or upgrade. Based on our review, here are some likely causes :

=> Attaching a public IP triggers a reconfiguration or refresh of the network settings on the node and potentially on the LBU. This refresh makes the necessary routes or security group rules effective, enabling the node to communicate through the LBU as expected. In other words, adding a public IP forces the LBU or the network layer to re-evaluate the connection, which temporarily resolves the issue.

To Solve the Issue Permanently, we can Implement reconcileNetworkAttachment Logic which would periodically check the network attachment status of the load balancer, it will ensure that all necessary network attachments (e.g., routes, security group rules) are in place and consistent for each node, even as nodes are created, rebooted, or updated:

pierreozoux commented 1 day ago

@outscale-hmi thanks for your answer, I'm not sure I understand everything you said, I'd love to spend 30m/60m with you to discuss about this topic.

Do you have a matrix account somewhere? Or an email?

It seems to me you are acknowledging the LBU bug, but don't consider to fix the bug there. I think it was working sometime before April 2024. So I think it is a regression that it would be possible to fix, and I think the best plae to fix is the LBU. But, I don't have access to this, you do.

But this bug + https://github.com/outscale/cluster-api-provider-outscale/issues/383 it means, we are stuck with this cluster API on outscale.

pierreozoux commented 1 day ago

Actually, I think it is more network infra/seurity group/ VPC around the LBU, than the LBU itself.