Removal of the status object from the CRD results in Kuryr Controller crashing

bshephar commented 3 years ago

Description of problem: In some situations, it is necessary to forcefully delete Octavia LB's from OpenStack. In this situation, the way we prompt Kuryr to recreate them, is by removing the information from the status object in the kuryrloadbalancer CRD:

Which needs be changed to: {} $ oc get kuryrloadbalancer -n openshift-monitoring grafana -o jsonpath='{.status}' | jq . {}

If the user inadvertently deletes the status object though, this will force kuryr-controller to return a traceback that ultimately it is unable to recover from until the status object is returned.

Version-Release number of selected component (if applicable): bash-4.4$ rpm -qa | grep kuryr python3-kuryr-lib-1.1.1-0.20190923160834.41e6964.el8ost.noarch python3-kuryr-kubernetes-4.7.0-202101262230.p0.git.2494.cd95ce5.el8.noarch openshift-kuryr-controller-4.7.0-202101262230.p0.git.2494.cd95ce5.el8.noarch openshift-kuryr-common-4.7.0-202101262230.p0.git.2494.cd95ce5.el8.noarch

How reproducible: 100%

Steps to Reproduce:

Edit the kuryrloadbalancer CRD for one of the LB's: oc edit kuryrloadbalancer -n openshift-monitoring grafana Remove everything from status: down. Including the status: line

eg: [...]

name: https port: 3000 protocol: TCP targetPort: https project_id: e75466bcb2eb4cf590026be2d94d95ef provider: ovn security_groups_ids:
e9d30328-ea13-4434-9ed2-fe8f4ddb3173 subnet_id: 0b048882-9b6c-4a5d-97eb-e613645c90fd type: ClusterIP status: listeners:
id: ea42c50c-b86f-40d7-a98a-310b46f16b70 loadbalancer_id: 88648171-6441-4e16-8bd8-7959b9a52fae name: openshift-monitoring/grafana:TCP:3000 [...]

To this: [...]

name: https port: 3000 protocol: TCP targetPort: https project_id: e75466bcb2eb4cf590026be2d94d95ef provider: ovn security_groups_ids:
e9d30328-ea13-4434-9ed2-fe8f4ddb3173 subnet_id: 0b048882-9b6c-4a5d-97eb-e613645c90fd type: ClusterIP [...]

Observe kuryr-controller starts failing with the following traceback: 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging [-] Failed to handle event {'type': 'MODIFIED', 'object': {'apiVersion': 'openstack.org/v1', 'kind': 'KuryrLoadBalancer', 'metadata': {'creationTimestamp': '2021-03-01T06:08:28Z', 'finalizers': ['kuryr.openstack.org/kuryrloadbalancer-finalizers'], 'generation': 34, 'managedFields': [{'apiVersion': 'openstack.org/v1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {'.': {}, 'v:"kuryr.openstack.org/kuryrloadbalancer-finalizers"': {}}}, 'f:spec': {'.': {}, 'f:endpointSlices': {}, 'f:ip': {}, 'f:ports': {}, 'f:project_id': {}, 'f:provider': {}, 'f:security_groups_ids': {}, 'f:subnet_id': {}, 'f:type': {}}}, 'manager': 'python-requests', 'operation': 'Update', 'time': '2021-03-01T22:30:36Z'}], 'name': 'grafana', 'namespace': 'openshift-monitoring', 'resourceVersion': '2140553', 'selfLink': '/apis/openstack.org/v1/namespaces/openshift-monitoring/kuryrloadbalancers/grafana', 'uid': '1e8a70c2-350d-418c-b876-152cbb7d2f4b'}, 'spec': {'endpointSlices': [{'endpoints': [{'addresses': ['10.128.57.183'], 'conditions': {'ready': True}, 'targetRef': {'kind': 'Pod', 'name': 'grafana-6f4d96d7fd-vm8sv', 'namespace': 'openshift-monitoring', 'resourceVersion': '63165', 'uid': '04630764-2c7e-4e86-a4e8-f986f26931cd'}}], 'ports': [{'name': 'https', 'port': 3000, 'protocol': 'TCP'}]}], 'ip': '172.30.88.169', 'ports': [{'name': 'https', 'port': 3000, 'protocol': 'TCP', 'targetPort': 'https'}], 'project_id': 'e75466bcb2eb4cf590026be2d94d95ef', 'provider': 'ovn', 'security_groups_ids': ['e9d30328-ea13-4434-9ed2-fe8f4ddb3173'], 'subnet_id': '0b048882-9b6c-4a5d-97eb-e613645c90fd', 'type': 'ClusterIP'}}}: KeyError: 'status' 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging Traceback (most recent call last): 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/logging.py", line 37, in call 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging self._handler(event, *args, *kwargs) 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 80, in call 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging self._handler(event, args, **kwargs) 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 84, in call 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging self.on_present(obj) 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/loadbalancer.py", line 65, in on_present 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging crd_lb = loadbalancer_crd['status'].get('loadbalancer') 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging KeyError: 'status' 2021-03-01 22:35:00.876 1 ERROR kuryr_kubernetes.handlers.logging 2021-03-01 22:35:01.243 1 ERROR kuryr_kubernetes.controller.managers.health [-] Component KuryrLoadBalancerHandler is dead.

Actual results: kuryr-controller crashes without the status object

Expected results: If the status object is required, it shouldn't be something that can be removed.

Additional info: I only tested this on OCP4.7. But I suspect it would be the same on 4.6. I don't think it's possible to make the status object required, since it's the default status object, eg: https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/

Maybe we can just handle it better by creating it if we can't find it?

bshephar commented 3 years ago

If the status object is re-added as status: {}. Then Kuryr will get all the info from Octavia and repopulate it.

Maybe

if not loadbalancer_crd['status']:
    # recreate it as status: {}

bshephar commented 3 years ago

So, we spent a fair bit of time looking at this today. While it would probably work if we just added something like this:

❯ git diff                                                                                                                                                                                ─╯
diff --git a/kuryr_kubernetes/controller/handlers/loadbalancer.py b/kuryr_kubernetes/controller/handlers/loadbalancer.py
index c139b4f6..8be485a0 100644
--- a/kuryr_kubernetes/controller/handlers/loadbalancer.py
+++ b/kuryr_kubernetes/controller/handlers/loadbalancer.py
@@ -62,6 +62,9 @@ class KuryrLoadBalancerHandler(k8s_base.ResourceEventHandler):
                       loadbalancer_crd['metadata']['name'])
             return

+        if 'status' not in loadbalancer_crd:
+            loadbalancer_crd['status'] = {}
+
         crd_lb = loadbalancer_crd['status'].get('loadbalancer')
         if crd_lb:
             lb_provider = crd_lb.get('provider')

I think this would probably work as we would end up finding our way back here: https://github.com/openshift/kuryr-kubernetes/blob/master/kuryr_kubernetes/controller/handlers/loadbalancer.py#L711-L725

Which would recreate the status field. I haven't created a new Kuryr Controller container and tested that yet. But, it's very hacky. I think we could probably do this better by checking the health of SVC's. If we notice one doesn't work, we could query Octavia. If we get a "Doesn't exist" response back, we could patch status: {} and let the sync_lbaas_loadbalancer() function recreate it.

We discussed querying Octavia to validate LB's exist. But I think if we try to reconcile and query Octavia about every LB every few seconds we're going to crash Octavia, OVN or both for large clusters. Particularly because we're querying loadbalancers, listeners, pools, members, and some of those queries start asking Neutron for details, etc. It's a fair bit of excess for achieving the solution.

Happy for some additional feedback on this from the team. Would be great to know if you guys envision something different here?

If for whatever reason, the status has been removed from the CRD, then you can just patch it back into existence and things are happy from there:

oc patch -n openshift-monitoring klb grafana -p='{"status": {}}' --type=merge

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/kuryr-kubernetes/issues/468#issuecomment-889918512): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

openshift / kuryr-kubernetes

Removal of the status object from the CRD results in Kuryr Controller crashing #468