kubevirt / kubevirtci

Contains cluster definitions and client tools to quickly spin up and destroy ephemeral and scalable k8s and ocp clusters for testing
Apache License 2.0
79 stars 118 forks source link

gocli: Add retry to creating and starting dnsmasq container #1260

Open brianmcarey opened 2 weeks ago

brianmcarey commented 2 weeks ago

What this PR does / why we need it:

There is a flake that causes make cluster-up to fail early when trying to create the dnsmasq container due to a port collision[1]. This looks to be a race condition and is very difficult to reproduce.

As this happens on every lane that uses the kubevirtci virtual cluster providers, the cumulative impact is quite high.

Adding 3 retries should help to avoid make cluster-up failing due to this.

[1] https://search.ci.kubevirt.io/?search=bind%3A+address+already+in+use&maxAge=336h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #

Special notes for your reviewer:

/cc @dhiller @xpivarc

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR. Approvers are expected to review this list.

Release note:

kubevirt-bot commented 2 weeks ago

@brianmcarey: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
check-up-kind-1.27-vgpu fe2713069132daa4d2aa1206e1eb783ca502ee09 link false /test check-up-kind-1.27-vgpu
Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
brianmcarey commented 2 weeks ago

Questions:

* how does a port collision happen if the port is chosen randomly? This sounds to me that the same "random" port is assigned twice?

The port could be used by an outbound connection from the test pod unrelated to the container between creating the container and starting the container.

* How do we arrive at exactly two seconds for the waiting time?

I just wanted some time between removing the failed container and recreating the container again to ensure it is fully removed. 2 seconds didn't seem like too much or too little.

kubevirt-bot commented 2 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhiller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubevirt/kubevirtci/blob/main/OWNERS)~~ [dhiller] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment