replicatedhq / troubleshoot

Preflight Checks and Support Bundles Framework for Kubernetes Applications
https://troubleshoot.sh
Apache License 2.0
543 stars 93 forks source link

RunPod collector improvements #1177

Open banjoh opened 1 year ago

banjoh commented 1 year ago

Describe suggested improvements

✔ ~/repos/replicated/troubleshoot [pr/1172 L|✚ 1…2⚑ 3]
[evans] $ ./bin/support-bundle spec.yaml -v0 && kubectl get pods

 * failed to run collector: run-pod/my-counter: failed to run pod: failed to create pod: object is being deleted: pods "my-counter" already exists
 Collecting support bundle ⠏ cluster-info
support-bundle-2023-05-23T17_48_39.tar.gz

============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
run-pod/my-counter (S) : 4,373ms
cluster-resources (S)  : 2,128ms
run-pod/my-counter (F) : 3ms
cluster-info (S)       : 2ms

============ Redactors summary =============
In-cluster collectors : 598ms

============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed

Duration: 7,133ms
NAME         READY   STATUS        RESTARTS     AGE
my-counter   1/1     Terminating   1 (1s ago)   5s

Possible improvement to add

Modifying the grace period to "force" deleting the pod immediately might be the trick here. I made this change here. We'd need to consider what side effects this has on the CRI cause a "force" deletion deletes the pod object and instructs the CRI to stop containers. These should be normal operation, but worth calling out for when this gets addressed.

    defer func() {
        if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}); err != nil {
        g := int64(0)
        if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{
            GracePeriodSeconds: &g,
        }); err != nil {
            klog.Errorf("Failed to delete pod %s: %v", pod.Name, err)
        }
    }()

Results

[evans] $ ./bin/support-bundle spec.yaml -v0 && kubectl get pods

 * failed to run collector: run-pod/my-counter: timeout
 Collecting support bundle ⠴ my-counter
support-bundle-2023-05-23T17_40_09.tar.gz

============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
run-pod/my-counter (F) : 5,063ms
cluster-resources (S)  : 2,131ms
cluster-info (S)       : 2ms

============ Redactors summary =============
In-cluster collectors : 511ms

============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed

Duration: 7,735ms
No resources found in default namespace.

Additional context

This issue spawned from https://github.com/replicatedhq/troubleshoot/pull/1172#issuecomment-1559272764 conversation

DexterYan commented 1 year ago

Hey @banjoh,

Would to seek your advice about this PR https://github.com/replicatedhq/troubleshoot/pull/1194

I have used a different way. In my PR, I am checking and waiting for the pod to be fully deleted, instead of deleting it twice with a grace period set to 0. So it has passed your example spec.

However, I am using a maximum 2 minutes to wait for a pod to be totally deleted. If it is over 2 minutes, I just throw an error and suggest that it may have an issue in terminating. Would you think it is better to use force delete instead of throwing an error?

banjoh commented 1 year ago

Waiting for graceful shutdown is a better approach. My only concern here is how we will ensure we honor the collector's timeout duration. We document that it maxes out at 30s. In a scenario where a pod runs for 20s, it will be left with 10s for cleanup before the "framework" stops the collector and continues executing the next collector.

In addition to completing (timing out or cleanly), collector needs to ensure it cleans up after itself before the next one runs, else we start seeing issues like this.

banjoh commented 1 year ago

Waiting for graceful shutdown is a better approach. My only concern here is how we will ensure we honor the collector's timeout duration. We document that it maxes out at 30s. In a scenario where a pod runs for 20s, it will be left with 10s for cleanup before the "framework" stops the collector and continues executing the next collector.

In addition to completing (timing out or cleanly), collector needs to ensure it cleans up after itself before the next one runs, else we start seeing issues like this.

My comments in https://github.com/replicatedhq/troubleshoot/pull/1194 PR supersede comments here

xavpaice commented 1 year ago

@banjoh was this resolved by #1196?

banjoh commented 1 year ago

https://github.com/replicatedhq/troubleshoot/pull/1196 addressed an issue in the copy_from_host collector which launches a daemon set, runs a command to copy files from the host and exits. This issue is for the run_pod collector which launches a pod but has no control of what commands will be run, hence bullet one in the description.