openshift / openshift-sdn

Apache License 2.0
69 stars 63 forks source link

[DO NOT MERGE] subnet: add locking to fix duplicate subnet allocations (rh #1323275) #282

Closed dcbw closed 8 years ago

dcbw commented 8 years ago

The subnet allocator's GetNetwork() function (which allocates and returns a subnet for a node) can be called from multiple goroutines and thus needs locking to prevent duplicate subnet allocations.

danwinship commented 8 years ago

How does it get called from multiple goroutines?

Given that the bz bug mentions a HA setup, I assumed the problem was that we were running the code on multiple masters at once... (though I had thought someone convinced me at one point that that couldn't actually happen, for some reason that I don't remember).

pravisankar commented 8 years ago

I do not understand why we need this change? watchNodes() waits for node events and these events are processed serially and may call addNode()/deleteNode() which in turn will use subnet allocator to allocate/release subnetIP. Synchronization is implicitly provided by the event queue.

Is the HA setup right? RunSDNController() as part of openshift startControllers should be run on only one of the HA masters otherwise we will have race conditions, even this mutex lock won't help as this will provide synchronization within a single master but not across masters.

dcbw commented 8 years ago

Is the HA setup right? RunSDNController() as part of openshift startControllers should be run on only one of the HA masters otherwise we will have race conditions, even this mutex lock won't help as this will provide synchronization within a single master but not across masters.

You are correct; I missed the HA angle here and the reporter is definitely running HA. Doesn't the requirement of a single SDN master kinda subvert HA though? Do we need to modify openshift-ansible to configure only the first master with the SDN plugin?