vmware-tanzu / sonobuoy

Sonobuoy is a diagnostic tool that makes it easier to understand the state of a Kubernetes cluster by running a set of Kubernetes conformance tests and other plugins in an accessible and non-destructive manner.
https://sonobuoy.io
Apache License 2.0
2.89k stars 344 forks source link

If a plugin pod's host is terminated, aggregator gets stuck #1978

Open jpdstan opened 2 months ago

jpdstan commented 2 months ago

What steps did you take and what happened: [A clear and concise description of what the bug is.] tl;dr - if your sonobuoy aggregator and sonobuoy plugin pods are running on separate hosts, and the sonobuoy plugin's host dies, then the sonobuoy aggregator will get stuck with the following message and keep infinitely retrying until timeout:

time="2024-06-28T21:49:39Z" level=error msg="could not find pod created by plugin my-plugin-test, will retry: no pods were created by plugin my-plugin-test"
  1. sonobuoy run with a test that takes >few minutes to finish
  2. wait for the sonobuoy pod to create the plugin pod (e.g. sonobuoy-my-plugin-test)
  3. force delete the node that sonobuoy-my-plugin-test is running on. it MUST be a different node than the sonobuoy pod.
  4. check the logs of the sonobuoy pod.

What did you expect to happen: it would be good if sonobuoy re-created the plugin pods. perhaps we could add a timeout for this check and try to re-create the pods if it times out.

alternatively, we can have the parent caller of sonobuoy run do the retry, but i'm wondering if there's a better way to do this in sonobuoy itself.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment: