Test suite panic when running SRIOV test lanes with podman based bootstrap image

brianmcarey commented 1 year ago

The majority of prowjobs run successfully using the new podman based bootstrap image.

An issue occurs when running the sriov test lanes with the podman bootstrap. A panic occurs during the BeforeSuite and AfterSuite stages which prevents the tests from running .

{ Panic tests/tests_suite_test.go:95
Test Panicked
tests/util/util.go:20}

https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/logs/periodic-kubevirt-e2e-k8s-1.22-sriov/1571793478654889984

brianmcarey commented 1 year ago

/assign

brianmcarey commented 1 year ago

It looks like there is a timeout calling the kubevirt operator webhook

10:50:10:   Test Panicked
10:50:10:   In [SynchronizedBeforeSuite] at: tests/util/util.go:20
10:50:10: 
10:50:10:   Internal error occurred: failed calling webhook "kubevirt-update-validator.kubevirt.io": Post "[https://kubevirt-operator-webhook.kubevirt.svc:443/kubevirt-validate-update?timeout=10s](https://kubevirt-operator-webhook.kubevirt.svc/kubevirt-validate-update?timeout=10s)": context deadline exceeded

I tried to reproduce this issue locally using the kind-1.23 provider and I was unable to hit this issue - it maybe due to things taking a little bit longer on the CI clusters

It has also been seen on the ARM cluster which is not carrying any load at the moment. https://github.com/kubevirt/project-infra/pull/2351#issuecomment-1301764102

brianmcarey commented 1 year ago

Increasing the webhook timeout did not resolve the issue -

Internal error occurred: failed calling webhook "kubevirt-update-validator.kubevirt.io": Post "[https://kubevirt-operator-webhook.kubevirt.svc:443/kubevirt-validate-update?timeout=20s](https://kubevirt-operator-webhook.kubevirt.svc/kubevirt-validate-update?timeout=20s)": context deadline exceeded

https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/8735/pull-kubevirt-e2e-kind-1.22-sriov/1590278800854224896#1:build-log.txt%3A3873

The vpgu lane which is also a kind based lane runs successfully under podman - the main difference between the setup of the vgpu lane and the sriov lane is that the vgpu lane is a single node kind cluster.

oshoval commented 1 year ago

Maybe we can run the CI container locally and check ? If we manage, it will show its either indeed the config files of CI or the fact it is podman in container (and then it might be easier to debug locally if it happens also locally within the container)

oshoval commented 1 year ago

We might move to https://github.com/k3d-io/k3d I wonder if it will fix the problem out of the box (no need SRIOV for the POC because it is related to the cluster and CI conf) It can enhance motivation to move to it if so

ormergi commented 1 year ago

It seems that at the time the mentioned periodic job the SRIOV provider used KIND 0.11.1 but podman Netavark backend support was added on version 0.15.0 https://github.com/kubernetes-sigs/kind/releases/tag/v0.15.0.

@brianmcarey could you please try again now that the provider is bumped and uses the latest version of KIND?

oshoval commented 1 year ago

lets try https://github.com/kubevirt/project-infra/pull/2571

oshoval commented 1 year ago

doesnt seem to solve it https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_project-infra/2571/rehearsal-check-up-kind-1.23-sriov/1618525187903328256

ormergi commented 1 year ago

Thanks for trying this.

It has something to do with the new podman Netavark network backend, when the CNI backend is set it works well at least this is what I saw on two other setups.

kubevirt-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

brianmcarey commented 1 year ago

/remove-lifecycle rotten

brianmcarey commented 1 year ago

Installing iptables-legacy in the bootstrap environment allows the test lane to get past this panic however all of tests fail. https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/logs/periodic-kubevirt-e2e-kind-1.23-sriov/1664578131740069888

There looks to be an issue with using kind with an nftables backend - https://kind.sigs.k8s.io/docs/user/known-issues/#firewalld - so I tried a test with iptables-legacy installed.

oshoval commented 1 year ago

What about disabling firewalld ? and keeping nftables ? It seems from glance that the tests fail because the nodes arent reachable still correctly right?

btw i understood from @0xFelix that soon podman will support passt, maybe it will be better to use it instead the slirp4netns ? it might work better with netavark that we have on CI (locally podman with kind SR-IOV does pass, so netavark is one of the suspects) https://www.reddit.com/r/podman/comments/11isru0/is_it_recommended_to_chose_pasta_over_slirp4netns/

oshoval commented 1 year ago

Hi mode: iptables we are using this mode on kind

https://github.com/kubernetes-sigs/kind/issues/3171 maybe can be interesting ? not sure

might opening an issue on kind and using --retain to ask their help maybe ?

kubevirt-bot commented 10 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

oshoval commented 10 months ago

do we want to remove lifecycle ? as this is not fixed yet

brianmcarey commented 10 months ago

/remove-lifecycle stale

brianmcarey commented 10 months ago

/lifecycle frozen

kubevirt / project-infra

Test suite panic when running SRIOV test lanes with podman based bootstrap image #2335