Node in cluster ended up duplicating an IP address

justmark commented 1 month ago

I followed your article on Medium - great writeup.

When I have applied the yaml against my cluster (there is 9 nodes), two of the nodes have ended up with the same IP address. I could see this by using arp-scan:

192.*.*.* *:*:*:*:*:* (Unknown) (DUP: 2)

Any thoughts on what the issue might be?

Thanks, Mark

patnordstrom commented 1 month ago

@justmark the script does attempt to provide a best effort approach to avoiding collisions based on these lines https://github.com/patnordstrom/lke-vlan/blob/main/prod/main.sh#L49-L52

There is a possibility, especially with a smaller CIDR block that two nodes coming online in a similar timeframe could generate the exact same IP address through the random selector in the code block above. I had thought about adding code to check for this collision and retry the IP selection process again if it found there was a duplicate as part of the generate_ip() function. If I get some time this week I can add something there as an additional check. One workaround would be to simply update your configuration profile on one of the LKE worker nodes that have a duplicate IP to remove the VLAN and then recycle the node. It should come back online and regenerate a new VLAN IP.

justmark commented 1 month ago

@patnordstrom Thanks for the reply. I will recycle the node, as suggested.

Am I following this correctly? It appears that the individual pod is part of the VLAN rather than the node itself. If this is correct, does this mean that I need to use your example code in the /dev folder to build out an application that will join the VLAN, and then launch my application within the same pod?

Thanks, Mark

patnordstrom commented 1 month ago

@justmark the deployment joins the worker nodes to the VLAN. The /dev folder is just used for local development of the script and is just for convenience. The only real difference is that it can pull in values from a configuration file instead of relying on the configuration objects in the Kubernetes cluster (e.g. the ConfigMap and Secret values). The main.sh script runs in a container as a DaemonSet so that it can register each node to the VLAN.

justmark commented 1 month ago

@patnordstrom Ok. Will need to try and trace this down further then. My app that is running its own pods is seemingly trying to use the private ip address rather than the VLAN address that is picked up by the vlan-join-controller.

Mark

justmark commented 1 month ago

@patnordstrom I connected to one of the vlan-controllers directly and issued ip addr show - I didn't see the VLAN listed here (just the internal IP). Clearly I am confused... I then ran kubectl describe node node-name and didn't see anything referencing the IP address for my VLAN. How/where can I see this level of detail?

Mark

patnordstrom commented 1 month ago

@justmark I did find a bug when testing this yesterday that was preventing IPs from being generated properly for the VLAN so that has been fixed. You can pull the latest version of the codebase for testing. The VLAN IP address is not going to show up when you run node commands with kubectl because it isn't configured in the cluster networking by default. If you connected via SSH to the node itself you could run ip addr show on the node itself to see that eth1 is setup for VLAN connectivity. You can see this in Cloud Manager as well by looking at the node details and the "configuration" tab within that.

I have an additional write up here that I think might help you. Please have a look and hopefully it resolves your issues: https://akamai-presentations.us-southeast-1.linodeobjects.com/1f3c51e4-55d3-4474-bf12-a7af58ec096d_More_Configuration_and_Testing_Tips_for_LKE_VLAN.pdf

justmark commented 1 month ago

@patnordstrom Fantastic. I will try and give this a spin later today. I really appreciate your help!

Thanks :)

justmark commented 1 month ago

@patnordstrom Ok, so I pulled version 1.3. I can see the VLAN addresses now, and when I connected to a shell, I can see that eth1 is using the correct VLAN. I'll try spinning up my applications tomorrow, but everything looks to be working so far! This is fantastic, thanks!

Mark

patnordstrom / lke-vlan

Node in cluster ended up duplicating an IP address #1