Make SDN watch resources more resilient

pravisankar commented 8 years ago

Depends on https://github.com/openshift/openshift-sdn/pull/293

danwinship commented 8 years ago

The commit "Refer resource names by constants" doesn't compile because you didn't change runEventQueue() to take a ResourceName instead of a string. (This gets fixed later on, but...)

pravisankar commented 8 years ago

On Mon, Apr 11, 2016 at 7:37 AM, Dan Winship notifications@github.com wrote:

The commit "Refer resource names by constants" doesn't compile because you didn't change runEventQueue() to take a ResourceName instead of a string. (This gets fixed later on, but...)

Ok, I will move the fix in the proper commit.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/openshift/openshift-sdn/pull/289#issuecomment-208378245

pravisankar commented 8 years ago

Extended networking tests passed. Manually simulated transient errors and repopulated the event queue every 3 mins during the testing for verification.

pravisankar commented 8 years ago

@openshift/networking PTAL

pravisankar commented 8 years ago

Rebased and resolved merge conflicts

danwinship commented 8 years ago

"Run Watch resources forever!" means that if there's bad data in etcd, we would get into an infinite loop spamming the log with error messages, right? (eg, if there's a node with IP "bob", we'd try to addNode() it, log "Invalid node IP 'bob'", then loop). In fact, it seems like in most cases, if we get an error once, we'd likely get it again after restarting (eg, "No subnets available")...

pravisankar commented 8 years ago

On Mon, Apr 25, 2016 at 1:19 PM, Dan Winship notifications@github.com wrote:

"Run Watch resources forever!" means that if there's bad data in etcd, we would get into an infinite loop spamming the log with error messages, right? (eg, if there's a node with IP "bob", we'd try to addNode() it, log "Invalid node IP 'bob'", then loop). In fact, it seems like in most cases, if we get an error once, we'd likely get it again after restarting (eg, "No subnets available")...

No, If we get an error from addNode(), we will log the error but we don't loop again. This change is mainly to ensure all our sdn watch routines are always up and running even in case of unforeseen errors (eventqueue pop returning an error or some panic while processing an event). Commit "SDN watch will try to fix transient errors" will try to process all the failed events periodically, events that are failed due to transient errors will be fixed but the ones that can't be fixed will spam the log.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub

https://github.com/openshift/openshift-sdn/pull/289#issuecomment-214507939

danwinship commented 8 years ago

generally lgtm though we need to finalize 293 first

dcbw commented 8 years ago

LGTM

openshift / openshift-sdn

Make SDN watch resources more resilient #289