Open brianmcarey opened 1 year ago
/assign
It looks like there is a timeout calling the kubevirt operator webhook
10:50:10: Test Panicked
10:50:10: In [SynchronizedBeforeSuite] at: tests/util/util.go:20
10:50:10:
10:50:10: Internal error occurred: failed calling webhook "kubevirt-update-validator.kubevirt.io": Post "[https://kubevirt-operator-webhook.kubevirt.svc:443/kubevirt-validate-update?timeout=10s](https://kubevirt-operator-webhook.kubevirt.svc/kubevirt-validate-update?timeout=10s)": context deadline exceeded
I tried to reproduce this issue locally using the kind-1.23 provider and I was unable to hit this issue - it maybe due to things taking a little bit longer on the CI clusters
It has also been seen on the ARM cluster which is not carrying any load at the moment. https://github.com/kubevirt/project-infra/pull/2351#issuecomment-1301764102
Increasing the webhook timeout did not resolve the issue -
Internal error occurred: failed calling webhook "kubevirt-update-validator.kubevirt.io": Post "[https://kubevirt-operator-webhook.kubevirt.svc:443/kubevirt-validate-update?timeout=20s](https://kubevirt-operator-webhook.kubevirt.svc/kubevirt-validate-update?timeout=20s)": context deadline exceeded
The vpgu lane which is also a kind based lane runs successfully under podman - the main difference between the setup of the vgpu lane and the sriov lane is that the vgpu lane is a single node kind cluster.
Maybe we can run the CI container locally and check ? If we manage, it will show its either indeed the config files of CI or the fact it is podman in container (and then it might be easier to debug locally if it happens also locally within the container)
We might move to https://github.com/k3d-io/k3d I wonder if it will fix the problem out of the box (no need SRIOV for the POC because it is related to the cluster and CI conf) It can enhance motivation to move to it if so
It seems that at the time the mentioned periodic job the SRIOV provider used KIND 0.11.1 but podman Netavark backend support was added on version 0.15.0 https://github.com/kubernetes-sigs/kind/releases/tag/v0.15.0.
@brianmcarey could you please try again now that the provider is bumped and uses the latest version of KIND?
Thanks for trying this.
It has something to do with the new podman Netavark network backend, when the CNI backend is set it works well at least this is what I saw on two other setups.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle rotten
Installing iptables-legacy
in the bootstrap environment allows the test lane to get past this panic however all of tests fail.
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/logs/periodic-kubevirt-e2e-kind-1.23-sriov/1664578131740069888
There looks to be an issue with using kind with an nftables backend - https://kind.sigs.k8s.io/docs/user/known-issues/#firewalld - so I tried a test with iptables-legacy installed.
What about disabling firewalld ? and keeping nftables ? It seems from glance that the tests fail because the nodes arent reachable still correctly right?
btw i understood from @0xFelix that soon podman will support passt, maybe it will be better to use it instead the slirp4netns ? it might work better with netavark that we have on CI (locally podman with kind SR-IOV does pass, so netavark is one of the suspects) https://www.reddit.com/r/podman/comments/11isru0/is_it_recommended_to_chose_pasta_over_slirp4netns/
Hi
mode: iptables
we are using this mode on kind
https://github.com/kubernetes-sigs/kind/issues/3171 maybe can be interesting ? not sure
might opening an issue on kind and using --retain to ask their help maybe ?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
do we want to remove lifecycle ? as this is not fixed yet
/remove-lifecycle stale
/lifecycle frozen
The majority of prowjobs run successfully using the new podman based bootstrap image.
An issue occurs when running the sriov test lanes with the podman bootstrap. A panic occurs during the BeforeSuite and AfterSuite stages which prevents the tests from running .
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/logs/periodic-kubevirt-e2e-k8s-1.22-sriov/1571793478654889984