Closed bshephar closed 3 years ago
If the status object is re-added as status: {}. Then Kuryr will get all the info from Octavia and repopulate it.
Maybe
if not loadbalancer_crd['status']:
# recreate it as status: {}
So, we spent a fair bit of time looking at this today. While it would probably work if we just added something like this:
❯ git diff ─╯
diff --git a/kuryr_kubernetes/controller/handlers/loadbalancer.py b/kuryr_kubernetes/controller/handlers/loadbalancer.py
index c139b4f6..8be485a0 100644
--- a/kuryr_kubernetes/controller/handlers/loadbalancer.py
+++ b/kuryr_kubernetes/controller/handlers/loadbalancer.py
@@ -62,6 +62,9 @@ class KuryrLoadBalancerHandler(k8s_base.ResourceEventHandler):
loadbalancer_crd['metadata']['name'])
return
+ if 'status' not in loadbalancer_crd:
+ loadbalancer_crd['status'] = {}
+
crd_lb = loadbalancer_crd['status'].get('loadbalancer')
if crd_lb:
lb_provider = crd_lb.get('provider')
I think this would probably work as we would end up finding our way back here: https://github.com/openshift/kuryr-kubernetes/blob/master/kuryr_kubernetes/controller/handlers/loadbalancer.py#L711-L725
Which would recreate the status field. I haven't created a new Kuryr Controller container and tested that yet. But, it's very hacky. I think we could probably do this better by checking the health of SVC's. If we notice one doesn't work, we could query Octavia. If we get a "Doesn't exist" response back, we could patch status: {} and let the sync_lbaas_loadbalancer() function recreate it.
We discussed querying Octavia to validate LB's exist. But I think if we try to reconcile and query Octavia about every LB every few seconds we're going to crash Octavia, OVN or both for large clusters. Particularly because we're querying loadbalancers, listeners, pools, members, and some of those queries start asking Neutron for details, etc. It's a fair bit of excess for achieving the solution.
Happy for some additional feedback on this from the team. Would be great to know if you guys envision something different here?
If for whatever reason, the status has been removed from the CRD, then you can just patch it back into existence and things are happy from there:
oc patch -n openshift-monitoring klb grafana -p='{"status": {}}' --type=merge
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
Description of problem: In some situations, it is necessary to forcefully delete Octavia LB's from OpenStack. In this situation, the way we prompt Kuryr to recreate them, is by removing the information from the status object in the kuryrloadbalancer CRD:
Which needs be changed to: {} $ oc get kuryrloadbalancer -n openshift-monitoring grafana -o jsonpath='{.status}' | jq . {}
If the user inadvertently deletes the status object though, this will force kuryr-controller to return a traceback that ultimately it is unable to recover from until the status object is returned.
Version-Release number of selected component (if applicable): bash-4.4$ rpm -qa | grep kuryr python3-kuryr-lib-1.1.1-0.20190923160834.41e6964.el8ost.noarch python3-kuryr-kubernetes-4.7.0-202101262230.p0.git.2494.cd95ce5.el8.noarch openshift-kuryr-controller-4.7.0-202101262230.p0.git.2494.cd95ce5.el8.noarch openshift-kuryr-common-4.7.0-202101262230.p0.git.2494.cd95ce5.el8.noarch
How reproducible: 100%
Steps to Reproduce:
eg: [...]
To this: [...]
Actual results: kuryr-controller crashes without the status object
Expected results: If the status object is required, it shouldn't be something that can be removed.
Additional info: I only tested this on OCP4.7. But I suspect it would be the same on 4.6. I don't think it's possible to make the status object required, since it's the default status object, eg: https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/
Maybe we can just handle it better by creating it if we can't find it?