avoid reusing stopped network controllers

jcaamano commented 1 week ago

network controllers might not support being stopped and then started again as that was never part of their contract so avoid doing that. Also, for safety, ensure Cleanup for a network runs to completion before attempting to create a new controller for the same network.

cathy-zhou commented 1 week ago

@jcaamano the problem with this fix is that if a new NAD with same nadName comes in when the old NAD failed to be deleted, the current level driven retry mechanism will only keep trying readding the new NAD not retrying deleting the old NAD.

jcaamano commented 1 week ago

@jcaamano the problem with this fix is that if a new NAD with same nadName comes in when the old NAD failed to be deleted, the current level driven retry mechanism will only keep trying readding the new NAD not retrying deleting the old NAD.

I see. The real problem of that is if the new NAD has a different NetInfo than the old NAD. We could replace the old controller with a new one but we should not obviate the CleanUp of the old controller in that case because the new controller is not guaranteed to be able to or even attempt to cleanup stuff from the old controller, if, for example, the controllers are of different types. And while we could handle this at runtime, we won't be able to do it if it happens at a time where ovnk is not running because then we have lost the information of what type or NetInfo the old controller had. And that is why we should probably dump the NetInfo of a controller in OVN DB so we can restore that information on startup. I recall having this comment in the early days of multinet and I think is what we should be focusing on fixing.

jcaamano commented 1 week ago

@cathy-zhou

I fixed your concern by also checking the isDeleted flag in the top level sync. If a network cleanup fails forever then we will block new networks to be defined under the same name. We need to decide if we want a different behavior.

I also used the topology type that we were storing in OVN DB to consider old topologies stale and clean those old OVN networks. We can't do the same for now for cluster manager since we don't store any details about the old network.

jcaamano commented 1 week ago

@cathy-zhou

I fixed your concern by also checking the isDeleted flag in the top level sync. If a network cleanup fails forever then we will block new networks to be defined under the same name. We need to decide if we want a different behavior.

I also used the topology type that we were storing in OVN DB to consider old topologies stale and clean those old OVN networks. We can't do the same for now for cluster manager since we don't store any details about the old network.

https://github.com/ovn-org/ovn-kubernetes/pull/4276#discussion_r1644498499

Closing this since I became aware of an inherent problem for which we would need to change the approach

ovn-org / ovn-kubernetes

avoid reusing stopped network controllers #4449