Closed pravisankar closed 8 years ago
@danwinship @dcbw PTAL
LGTM
The implications of this are that you loop for 10 seconds and then what happens? openshift exits in an error state or does it go on to mark itself as ready but not on the SDN? Sorry if this is a dumb question.
it will exit with an error
maybe we need to distinguish "master is not running" (in which case keep waiting [but also handle SIGTERM]) from "master is running but it's not assigning a subnet to this host"
FWIW, the systemd service for the node is Restart=always (perhaps we should change that to on-failure?) so systemd will restart the node indefinitely. This is mainly because it's a distributed system and if you reboot an entire environment there's no guarantee that the master is up before your node tries to start.
'Master is not running' message will be helpful but I don't think we want to keep looping in any case. There might be a bug in allocating subnet in master or subnet entry no longer exists in etcd or api call failed with other reasons (network, auth...). Even if the master is not up before node, I think it's okay for the node to fail and restart few times.
Fix related to https://bugzilla.redhat.com/show_bug.cgi?id=1194467