Ensure NodeIP doesn't overlap with the cluster network

openshift / openshift-sdn

Apache License 2.0

69 stars 63 forks source link

Ensure NodeIP doesn't overlap with the cluster network #245

Closed dcbw closed 8 years ago

dcbw commented 8 years ago

If a HostSubnet is edited or the node's IP address is changed such that the NodeIP is contained within its own SubnetCIDR, a routing loop will be created in the OVS bridge and cause a kernel panic by exhausing kernel stack space.

Prevent the cluster from starting when this happens by ensuring that the NodeIP is not contained within the cluster network CIDR.

https://bugzilla.redhat.com/show_bug.cgi?id=1295486

danwinship commented 8 years ago

Looks good, although should this be validated by the node rather than the master? I guess if the master fails then the node won't ever be able to finish starting up, so having the master validate it will ensure that the node doesn't create the routing loop anyway.

pravisankar commented 8 years ago

Yes, This will handle node IP conflicts with cluster network for exiting subnets but we also need to handle one more case. We allow node IP changes. We watch for this change and when this happens, we recreate the subnet. New subnet creation in this case should be only allowed if the new node IP doesn't conflict with cluster network.

dcbw commented 8 years ago

Fixed it up so that node events are now checked for validity too.

danwinship commented 8 years ago

~~LGTM~~

danwinship commented 8 years ago

should this be validated by the node rather than the master?

dcbw realized later that the answer to this is probably "yes", because if the master exits immediately when the data in etcd is invalid, then you can't use "oc" to fix it.

dcbw commented 8 years ago

I chose to do two sides to this. The master should warn about invalid Node IPs and refuse to create HostSubnet objects for them. The node should exit if its address is bad, and should ignore invalid HostSubnets from other nodes so we don't create flow rules for them.

dcbw commented 8 years ago

@danwinship @marun repushed; now the Registry caches the clusterNetwork so we can get rid of oc.clusterNetwork.

dcbw commented 8 years ago

And now broken up into separate commits for easier reviewing...

dcbw commented 8 years ago

@pravisankar fixed your comment, PTAL

danwinship commented 8 years ago

LGTM. Have you run origin's tests with this code yet?

pravisankar commented 8 years ago

Rest of the changes, LGTM

dcbw commented 8 years ago

Reworked to address Ravi's comments.

dcbw commented 8 years ago

@pravisankar PTAL

knobunc commented 8 years ago

LGTM

dcbw commented 8 years ago

@danwinship in passes extended tests in DIND at least.

pravisankar commented 8 years ago

LGTM