openshift / kuryr-kubernetes

kuryr-kubernetes - CNI plugin using OpenStack Neutron and Octavia to provide networking for pods and services.
Apache License 2.0
21 stars 23 forks source link

[release-4.10] Bug OCPBUGS-2216: Cleanup KuryrPort when Pod is missing #702

Closed dulek closed 1 year ago

dulek commented 2 years ago

We can easily imagine an user frustrated by his pod not getting deleted and opting to remove the finalizer from the Pod. If the cause of the deletion delay was the kuryr-controller being down, we end up with an orphaned KuryrPort. At the moment this causes crashes, which obviously it shouldn't. Moreover we should figure out how to clean up the Neutron port if that happens. This commit does so as explained below.

  1. KuryrPort on_present() will trigger its deletion when it detects that Pod does not longer exist.
  2. Turns out security_groups parameter passed to release_vif() was never used. I removed it from drivers and got rid of get_security_groups() call from on_finalize() as it's no longer necessary.
  3. When we cannot get the Pod in KuryrPort on_finalize() we attempt to gather info required to cleanup the KuryrPort and "mock" a Pod object. A precaution is added that any error from release_vif() is ignored in that case to make sure failed cleanup is not causing the system to go down.

Change-Id: Iaf48296ff28394823f68d58362bcc87d38a2cd42

openshift-ci-robot commented 2 years ago

@dulek: This pull request references Jira Issue OCPBUGS-2216, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/kuryr-kubernetes/pull/702): >We can easily imagine an user frustrated by his pod not getting deleted and opting to remove the finalizer from the Pod. If the cause of the deletion delay was the kuryr-controller being down, we end up with an orphaned KuryrPort. At the moment this causes crashes, which obviously it shouldn't. Moreover we should figure out how to clean up the Neutron port if that happens. This commit does so as explained below. > >1. KuryrPort on_present() will trigger its deletion when it detects that Pod does not longer exist. >2. Turns out security_groups parameter passed to release_vif() was never used. I removed it from drivers and got rid of get_security_groups() call from on_finalize() as it's no longer necessary. >3. When we cannot get the Pod in KuryrPort on_finalize() we attempt to gather info required to cleanup the KuryrPort and "mock" a Pod object. A precaution is added that any error from release_vif() is ignored in that case to make sure failed cleanup is not causing the system to go down. > >Change-Id: Iaf48296ff28394823f68d58362bcc87d38a2cd42 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 2 years ago

@dulek: No Bugzilla bug is referenced in the title of this pull request. To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to [this](https://github.com/openshift/kuryr-kubernetes/pull/702): >[release-4.10] Bug OCPBUGS-2216: Cleanup KuryrPort when Pod is missing Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dulek commented 2 years ago

/retest

Weird, quite a lot of failures.

dulek commented 2 years ago

/retest /jira refresh

openshift-ci-robot commented 2 years ago

@dulek: This pull request references Jira Issue OCPBUGS-2216, which is valid.

6 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.10.z) matches configured target version for branch (4.10.z) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST) * dependent bug [Jira Issue OCPBUGS-1713](https://issues.redhat.com//browse/OCPBUGS-1713) is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE)) * dependent [Jira Issue OCPBUGS-1713](https://issues.redhat.com//browse/OCPBUGS-1713) targets the "4.11.z" version, which is one of the valid target versions: 4.11.0, 4.11.z * bug has dependents

Requesting review from QA contact: /cc @eurijon

In response to [this](https://github.com/openshift/kuryr-kubernetes/pull/702#issuecomment-1285157800): >/retest >/jira refresh Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
MaysaMacedo commented 1 year ago

/lgtm

dulek commented 1 year ago

/retest

dulek commented 1 year ago

/retest must-gather is missing

openshift-ci[bot] commented 1 year ago

@dulek: The /retest command does not accept any targets. The following commands are available to trigger required jobs:

Use /test all to run all jobs.

In response to [this](https://github.com/openshift/kuryr-kubernetes/pull/702#issuecomment-1325115212): >/retest must-gather is missing Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dulek commented 1 year ago

/retest

dulek commented 1 year ago

/retest

Let's see if these issues are still the case.

MaysaMacedo commented 1 year ago

/retest

dulek commented 1 year ago

/hold

This has a tiny bug here, security_groups should not be passed: https://github.com/openshift/kuryr-kubernetes/blob/03b98adde2ddaa1fd500833cc6a2a7cf881ba964/kuryr_kubernetes/controller/handlers/kuryrport.py#L335

openshift-bot commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

gryf commented 1 year ago

/hold

dulek commented 1 year ago

I think we should wait for #721. I'll directly include the fix here.

dulek commented 1 year ago

/hold cancel

Alright, this now includes #725.

dulek commented 1 year ago

/remove-lifecycle stale

dulek commented 1 year ago

/retest

MaysaMacedo commented 1 year ago

/lgtm /approve

openshift-ci[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dulek, gryf, MaysaMacedo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/kuryr-kubernetes/blob/release-4.10/OWNERS)~~ [MaysaMacedo,dulek,gryf] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
dulek commented 1 year ago

/retest

dulek commented 1 year ago

/label backport-risk-assessed

openshift-ci[bot] commented 1 year ago

@dulek: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
itzikb-redhat commented 1 year ago

/label cherry-pick-approved

openshift-ci-robot commented 1 year ago

@dulek: Jira Issue OCPBUGS-2216: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-2216 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/kuryr-kubernetes/pull/702): >We can easily imagine an user frustrated by his pod not getting deleted and opting to remove the finalizer from the Pod. If the cause of the deletion delay was the kuryr-controller being down, we end up with an orphaned KuryrPort. At the moment this causes crashes, which obviously it shouldn't. Moreover we should figure out how to clean up the Neutron port if that happens. This commit does so as explained below. > >1. KuryrPort on_present() will trigger its deletion when it detects that Pod does not longer exist. >2. Turns out security_groups parameter passed to release_vif() was never used. I removed it from drivers and got rid of get_security_groups() call from on_finalize() as it's no longer necessary. >3. When we cannot get the Pod in KuryrPort on_finalize() we attempt to gather info required to cleanup the KuryrPort and "mock" a Pod object. A precaution is added that any error from release_vif() is ignored in that case to make sure failed cleanup is not causing the system to go down. > >Change-Id: Iaf48296ff28394823f68d58362bcc87d38a2cd42 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.