openshift / kuryr-kubernetes

kuryr-kubernetes - CNI plugin using OpenStack Neutron and Octavia to provide networking for pods and services.
Apache License 2.0
21 stars 23 forks source link

OCPBUGS-11828: CNI: Watch for deleted pods #712

Closed dulek closed 1 year ago

dulek commented 1 year ago

Recent versions of cri-o and containerd are passing K8S_POD_UID as a CNI argument, alongside with K8S_POD_NAMESPACE and K8S_POD_NAME. As both latter variables cannot be used to safely identify a pod in the API (StatefulSet recreates pods with the same name), we were prone to race conditions in the CNI code that we could only workaround. The end effect was mostly IP conflict.

Now that the UID argument is passed, we're able to compare the UID from the request with the one in the API to make sure we're wiring the correct pod. This commit implements that by making sure to move the check to the code actually waiting for the pod to appear in the registry. In case of K8S_POD_UID missing from the CNI request, API call to retrieve Pod is used as a fallback.

We also know that this check doesn't work for static pods, so CRD and controller needed to be updated to include information if the pod is static on the KuryrPort spec, so that we can skip the check for the static pods without the need to fetch Pod from the API.

It can happen that we get the CNI request, but pod gets deleted before kuryr-controller was able to create KuryrPort for it. If kuryr-daemon only watches for KuryrPort events it will not be able to notice that and will wait until the timeout, which in effect doesn't play well with some K8s tests.

This commit adds a separate Service that will watch on Pod events and if Pod gets deleted we'll make sure to put a sentinel value (None) into the registry so that the thread waiting for the KuryrPort to appear there will know that it has to stop and raise an error.

openshift-ci[bot] commented 1 year ago

@dulek: No Bugzilla bug is referenced in the title of this pull request. To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to [this](https://github.com/openshift/kuryr-kubernetes/pull/712): >CNI: Watch for deleted pods Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dulek commented 1 year ago

/hold

Let's see if this helps at all.

dulek commented 1 year ago

/retest

This looks so much better!

MaysaMacedo commented 1 year ago

wow, indeed much better!

This is backporting the following fixes right? Good thing both are already in 4.11 https://github.com/openshift/kuryr-kubernetes/pull/657 https://github.com/openshift/kuryr-kubernetes/pull/645

MaysaMacedo commented 1 year ago

/retest

MaysaMacedo commented 1 year ago

Also, do you think they should be separate PRs? It might facilitate verification

MaysaMacedo commented 1 year ago

/retest

MaysaMacedo commented 1 year ago

/retest

dulek commented 1 year ago

/retest

openshift-ci[bot] commented 1 year ago

@dulek: This pull request references Bugzilla bug 11828, which is invalid:

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to [this](https://github.com/openshift/kuryr-kubernetes/pull/712): >Bug 11828: CNI: Watch for deleted pods Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 1 year ago

@dulek: No Bugzilla bug is referenced in the title of this pull request. To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to [this](https://github.com/openshift/kuryr-kubernetes/pull/712): >OCPBUGS-11828: CNI: Watch for deleted pods Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-ci-robot commented 1 year ago

@dulek: This pull request references Jira Issue OCPBUGS-11828, which is valid.

6 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.10.z) matches configured target version for branch (4.10.z) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST) * dependent bug [Bugzilla Bug 1983056](https://bugzilla.redhat.com/show_bug.cgi?id=1983056) is in the state CLOSED (ERRATA), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE)) * dependent [Bugzilla Bug 1983056](https://bugzilla.redhat.com/show_bug.cgi?id=1983056) targets the "4.11.0" version, which is one of the valid target versions: 4.11.0, 4.11.z * bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Jira (itbrown@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/kuryr-kubernetes/pull/712): >Recent versions of cri-o and containerd are passing K8S_POD_UID as a CNI >argument, alongside with K8S_POD_NAMESPACE and K8S_POD_NAME. As both >latter variables cannot be used to safely identify a pod in the API >(StatefulSet recreates pods with the same name), we were prone to race >conditions in the CNI code that we could only workaround. The end effect >was mostly IP conflict. > >Now that the UID argument is passed, we're able to compare the UID from >the request with the one in the API to make sure we're wiring the >correct pod. This commit implements that by making sure to move the >check to the code actually waiting for the pod to appear in the >registry. In case of K8S_POD_UID missing from the CNI request, API call >to retrieve Pod is used as a fallback. > >We also know that this check doesn't work for static pods, so CRD and >controller needed to be updated to include information if the pod is >static on the KuryrPort spec, so that we can skip the check for the >static pods without the need to fetch Pod from the API. > >It can happen that we get the CNI request, but pod gets deleted before >kuryr-controller was able to create KuryrPort for it. If kuryr-daemon >only watches for KuryrPort events it will not be able to notice that >and will wait until the timeout, which in effect doesn't play well with >some K8s tests. > >This commit adds a separate Service that will watch on Pod events and if >Pod gets deleted we'll make sure to put a sentinel value (None) into the >registry so that the thread waiting for the KuryrPort to appear there >will know that it has to stop and raise an error. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dulek commented 1 year ago

Also, do you think they should be separate PRs? It might facilitate verification

I'd rather merge this as a single PR as otherwise we need to override CI to merge this and we don't have the signal that this actually works.

dulek commented 1 year ago

So this looks fairly good besides that it needs openshift/origin#27878.

dulek commented 1 year ago

openshift/origin#27878 is merged, let's try again

/retest

dulek commented 1 year ago

/hold cancel

dulek commented 1 year ago

/lgtm /label backport-risk-assessed

openshift-ci[bot] commented 1 year ago

@dulek: you cannot LGTM your own PR.

In response to [this](https://github.com/openshift/kuryr-kubernetes/pull/712#issuecomment-1513272166): >/lgtm >/label backport-risk-assessed Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 1 year ago

@dulek: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dulek, gryf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/kuryr-kubernetes/blob/release-4.10/OWNERS)~~ [dulek,gryf] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
itzikb-redhat commented 1 year ago

/label cherry-pick-approved

openshift-ci-robot commented 1 year ago

@dulek: Jira Issue OCPBUGS-11828: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-11828 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/kuryr-kubernetes/pull/712): >Recent versions of cri-o and containerd are passing K8S_POD_UID as a CNI >argument, alongside with K8S_POD_NAMESPACE and K8S_POD_NAME. As both >latter variables cannot be used to safely identify a pod in the API >(StatefulSet recreates pods with the same name), we were prone to race >conditions in the CNI code that we could only workaround. The end effect >was mostly IP conflict. > >Now that the UID argument is passed, we're able to compare the UID from >the request with the one in the API to make sure we're wiring the >correct pod. This commit implements that by making sure to move the >check to the code actually waiting for the pod to appear in the >registry. In case of K8S_POD_UID missing from the CNI request, API call >to retrieve Pod is used as a fallback. > >We also know that this check doesn't work for static pods, so CRD and >controller needed to be updated to include information if the pod is >static on the KuryrPort spec, so that we can skip the check for the >static pods without the need to fetch Pod from the API. > >It can happen that we get the CNI request, but pod gets deleted before >kuryr-controller was able to create KuryrPort for it. If kuryr-daemon >only watches for KuryrPort events it will not be able to notice that >and will wait until the timeout, which in effect doesn't play well with >some K8s tests. > >This commit adds a separate Service that will watch on Pod events and if >Pod gets deleted we'll make sure to put a sentinel value (None) into the >registry so that the thread waiting for the KuryrPort to appear there >will know that it has to stop and raise an error. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-merge-robot commented 1 year ago

Fix included in accepted release 4.10.0-0.nightly-2023-04-21-212037

openshift-bot commented 1 year ago

[ART PR BUILD NOTIFIER]

This PR has been included in build kuryr-controller-container-v4.10.0-202305011254.p0.g57479a4.assembly.stream for distgit kuryr-controller. All builds following this will include this PR.

openshift-bot commented 1 year ago

[ART PR BUILD NOTIFIER]

This PR has been included in build kuryr-cni-container-v4.10.0-202305011254.p0.g57479a4.assembly.stream for distgit kuryr-cni. All builds following this will include this PR.