openshift / openshift-sdn

Apache License 2.0
69 stars 63 forks source link

Fix SDN get and watch resource workflow #241

Closed pravisankar closed 8 years ago

pravisankar commented 8 years ago

Trello card: https://trello.com/c/hB3SyOLw Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1275904

pravisankar commented 8 years ago

Tested on multi-node dev cluster

danwinship commented 8 years ago

So I guess the overall idea is that NewListWatchFromClient() does what watchAndGetResource() was trying to do, except without the bug?

The patches seem good, though I want to look through some parts of it again (and other people looking at them would still be great).

dcbw commented 8 years ago

Looks like a good cleanup to me, with the exception of the 5 second wait for service watching. That I wish we could make more reliable.

danwinship commented 8 years ago

assuming it passes tests/extended/networking.sh, and fixes 1275904, then LGTM

pravisankar commented 8 years ago

Rebased and patched on top of master branch. Added some more changes to ditch ugly 5 second wait for service watching. Ready for review/merge. @openshift/networking PTAL

pravisankar commented 8 years ago

Tested and passed extended networking tests (test/extended/networking.sh)

dcbw commented 8 years ago

Good cleanups...

Should we log something in the Watch* functions when the eventQueue fails, as long as the failure isn't "I'm done"? Otherwise transient errors could quietly kill the event queues and we'll be none-the-wiser from the logs. Also, I assume that's how we know when to terminate, when eventQueue.Pop() returns some "done" error?

Next outside the registry, how does subnets.go::watchSubnets() or watchNodes() know when to break out of the for() loop now that 'oc.stop' is gone?

One behavioral change (though I'm not sure if it matters?) is that now on startup, the "get all the things" calls will no longer block the main goroutine but instead are now done from a goroutine. That might cause race issues, since I think most of the stuff is currently done synchronously on startup. Not sure though?

danwinship commented 8 years ago

One behavioral change (though I'm not sure if it matters?) is that now on startup, the "get all the things" calls will no longer block the main goroutine but instead are now done from a goroutine. That might cause race issues, since I think most of the stuff is currently done synchronously on startup. Not sure though?

I think this will cause problems with endpoint filtering: the first call to OnEndpointsUpdate() might happen before WatchPods() has completely filled in registry.namespaceOfPodIP, causing warnings and incorrect filtering. It will eventually recover (since it gets asked to filter the entire list of endpoints every time any service changes), but it might cause problems at startup.

danwinship commented 8 years ago

Other than that though, LGTM

pravisankar commented 8 years ago

On Wed, Apr 6, 2016 at 1:56 PM, Dan Williams notifications@github.com wrote:

Good cleanups...

Should we log something in the Watch* functions when the eventQueue fails, as long as the failure isn't "I'm done"? Otherwise transient errors could quietly kill the event queues and we'll be none-the-wiser from the logs. Also, I assume that's how we know when to terminate, when eventQueue.Pop() returns some "done" error?

eventQueue.Pop() is a blocking call and will wait if there are no events in the queue but yes, it could fail due to various transient errors. Logging the error will definitely help us during investigating an issue. We could as well log the error and restart the event queue to overcome any transient failures.

Next outside the registry, how does subnets.go::watchSubnets() or watchNodes() know when to break out of the for() loop now that 'oc.stop' is gone?

oc.Stop() was never called before and currently there is no good way to call this method. These goroutines will be terminated/killed when the openshift service is stopped (current behavior).

One behavioral change (though I'm not sure if it matters?) is that now on startup, the "get all the things" calls will no longer block the main goroutine but instead are now done from a goroutine. That might cause race issues, since I think most of the stuff is currently done synchronously on startup. Not sure though?

Yes, some of the synchronous stuff is done async now. We need to synchronize if async goroutines depend on each other. populateVNIDMap() is added to address dependency between watchNetNamespaces and watchServices. As Winship pointed out in the next comment, I missed on populating namespaceOfPodIP that will be needed by proxy. I will fix this issue.

More challenging issue is synchronizing sdn master/node, I have seen 2 bugs (https://bugzilla.redhat.com/show_bug.cgi?id=1323279 and https://bugzilla.redhat.com/show_bug.cgi?id=1322130) where the node is yet to receive NetNamespace event but the service or pod-setup checks for vnid and fails. This can happen when the master is overloaded or too many projects are created in a short period.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/openshift/openshift-sdn/pull/241#issuecomment-206564250

pravisankar commented 8 years ago

Updated, fixed endpoints filtering during startup. Prefer to tackle other suggestions in a separate PR. Mainly to make watch routines resilient to transient errors and more logging. Winship did some logging work in https://github.com/openshift/openshift-sdn/pull/286 @openshift/networking PTAL

danwinship commented 8 years ago

LGTM now I think

pravisankar commented 8 years ago

Rebased and resolved merge conflicts. @openshift/networking please review/merge