Open banjoh opened 1 year ago
Hey @banjoh,
Would to seek your advice about this PR https://github.com/replicatedhq/troubleshoot/pull/1194
I have used a different way. In my PR, I am checking and waiting for the pod to be fully deleted, instead of deleting it twice with a grace period set to 0. So it has passed your example spec.
However, I am using a maximum 2 minutes to wait for a pod to be totally deleted. If it is over 2 minutes, I just throw an error and suggest that it may have an issue in terminating. Would you think it is better to use force delete instead of throwing an error?
Waiting for graceful shutdown is a better approach. My only concern here is how we will ensure we honor the collector's timeout duration. We document that it maxes out at 30s
. In a scenario where a pod runs for 20s
, it will be left with 10s
for cleanup before the "framework" stops the collector and continues executing the next collector.
In addition to completing (timing out or cleanly), collector needs to ensure it cleans up after itself before the next one runs, else we start seeing issues like this.
Waiting for graceful shutdown is a better approach. My only concern here is how we will ensure we honor the collector's timeout duration. We document that it maxes out at
30s
. In a scenario where a pod runs for20s
, it will be left with10s
for cleanup before the "framework" stops the collector and continues executing the next collector.In addition to completing (timing out or cleanly), collector needs to ensure it cleans up after itself before the next one runs, else we start seeing issues like this.
My comments in https://github.com/replicatedhq/troubleshoot/pull/1194 PR supersede comments here
@banjoh was this resolved by #1196?
https://github.com/replicatedhq/troubleshoot/pull/1196 addressed an issue in the copy_from_host collector which launches a daemon set, runs a command to copy files from the host and exits. This issue is for the run_pod collector which launches a pod but has no control of what commands will be run, hence bullet one in the description.
Describe suggested improvements
Possible improvement to add
Modifying the grace period to "force" deleting the pod immediately might be the trick here. I made this change here. We'd need to consider what side effects this has on the CRI cause a "force" deletion deletes the pod object and instructs the CRI to stop containers. These should be normal operation, but worth calling out for when this gets addressed.
Results
Additional context
This issue spawned from https://github.com/replicatedhq/troubleshoot/pull/1172#issuecomment-1559272764 conversation