Closed dulek closed 1 year ago
@dulek: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh
.
/hold
Let's see if this helps at all.
/retest
This looks so much better!
wow, indeed much better!
This is backporting the following fixes right? Good thing both are already in 4.11 https://github.com/openshift/kuryr-kubernetes/pull/657 https://github.com/openshift/kuryr-kubernetes/pull/645
/retest
Also, do you think they should be separate PRs? It might facilitate verification
/retest
/retest
/retest
@dulek: This pull request references Bugzilla bug 11828, which is invalid:
Comment /bugzilla refresh
to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.
@dulek: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh
.
@dulek: This pull request references Jira Issue OCPBUGS-11828, which is valid.
No GitHub users were found matching the public email listed for the QA contact in Jira (itbrown@redhat.com), skipping review request.
The bug has been updated to refer to the pull request using the external bug tracker.
Also, do you think they should be separate PRs? It might facilitate verification
I'd rather merge this as a single PR as otherwise we need to override CI to merge this and we don't have the signal that this actually works.
So this looks fairly good besides that it needs openshift/origin#27878.
openshift/origin#27878 is merged, let's try again
/retest
/hold cancel
/lgtm /label backport-risk-assessed
@dulek: you cannot LGTM your own PR.
@dulek: all tests passed!
Full PR test history. Your PR dashboard.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: dulek, gryf
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/label cherry-pick-approved
@dulek: Jira Issue OCPBUGS-11828: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-11828 has been moved to the MODIFIED state.
Fix included in accepted release 4.10.0-0.nightly-2023-04-21-212037
[ART PR BUILD NOTIFIER]
This PR has been included in build kuryr-controller-container-v4.10.0-202305011254.p0.g57479a4.assembly.stream for distgit kuryr-controller. All builds following this will include this PR.
[ART PR BUILD NOTIFIER]
This PR has been included in build kuryr-cni-container-v4.10.0-202305011254.p0.g57479a4.assembly.stream for distgit kuryr-cni. All builds following this will include this PR.
Recent versions of cri-o and containerd are passing K8S_POD_UID as a CNI argument, alongside with K8S_POD_NAMESPACE and K8S_POD_NAME. As both latter variables cannot be used to safely identify a pod in the API (StatefulSet recreates pods with the same name), we were prone to race conditions in the CNI code that we could only workaround. The end effect was mostly IP conflict.
Now that the UID argument is passed, we're able to compare the UID from the request with the one in the API to make sure we're wiring the correct pod. This commit implements that by making sure to move the check to the code actually waiting for the pod to appear in the registry. In case of K8S_POD_UID missing from the CNI request, API call to retrieve Pod is used as a fallback.
We also know that this check doesn't work for static pods, so CRD and controller needed to be updated to include information if the pod is static on the KuryrPort spec, so that we can skip the check for the static pods without the need to fetch Pod from the API.
It can happen that we get the CNI request, but pod gets deleted before kuryr-controller was able to create KuryrPort for it. If kuryr-daemon only watches for KuryrPort events it will not be able to notice that and will wait until the timeout, which in effect doesn't play well with some K8s tests.
This commit adds a separate Service that will watch on Pod events and if Pod gets deleted we'll make sure to put a sentinel value (None) into the registry so that the thread waiting for the KuryrPort to appear there will know that it has to stop and raise an error.