RunPod collector improvements

banjoh commented 1 year ago

Describe suggested improvements

Once a pod gets into the PodRunning state, the collector stops waiting any further and starts its clean up process (deleting the pod). This can potentially stop the pod's task prematurely.
When the collector deletes the pod, it probably should ignore the grace period and wait (with timeout) for the pod to get deleted. A new pod that gets created with the same name in the same namespace would fail if the old one is still being deleted. Below is an example of a pod left behind that led to another pod failing to schedule Here is the other use case where the spec name has the same name, leading to the same pod name. The error is debatable cause the collector should probably be named differently. Perhaps we should raise a warning/error when specs have the same name? Either way, here are the results
```
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
name: example
spec:
collectors:
- runPod:
    collectorName: "my-counter"
    timeout: 5s
    podSpec:
      containers:
      - name: my-counter
        image: ubuntu
        command: ["/bin/bash", "-c", 'for i in {1..2}; do echo "count $i"; sleep 1; done']
- runPod:
    collectorName: "my-counter"
    timeout: 5s
    podSpec:
      containers:
      - name: my-counter
        image: ubuntu
        command: ["/bin/bash", "-c", 'for i in {1..2}; do echo "count again $i"; sleep 1; done']
```

✔ ~/repos/replicated/troubleshoot [pr/1172 L|✚ 1…2⚑ 3]
[evans] $ ./bin/support-bundle spec.yaml -v0 && kubectl get pods

 * failed to run collector: run-pod/my-counter: failed to run pod: failed to create pod: object is being deleted: pods "my-counter" already exists
 Collecting support bundle ⠏ cluster-info
support-bundle-2023-05-23T17_48_39.tar.gz

============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
run-pod/my-counter (S) : 4,373ms
cluster-resources (S)  : 2,128ms
run-pod/my-counter (F) : 3ms
cluster-info (S)       : 2ms

============ Redactors summary =============
In-cluster collectors : 598ms

============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed

Duration: 7,133ms
NAME         READY   STATUS        RESTARTS     AGE
my-counter   1/1     Terminating   1 (1s ago)   5s

Possible improvement to add

Modifying the grace period to "force" deleting the pod immediately might be the trick here. I made this change here. We'd need to consider what side effects this has on the CRI cause a "force" deletion deletes the pod object and instructs the CRI to stop containers. These should be normal operation, but worth calling out for when this gets addressed.

    defer func() {
        if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}); err != nil {
        g := int64(0)
        if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{
            GracePeriodSeconds: &g,
        }); err != nil {
            klog.Errorf("Failed to delete pod %s: %v", pod.Name, err)
        }
    }()

Results

[evans] $ ./bin/support-bundle spec.yaml -v0 && kubectl get pods

 * failed to run collector: run-pod/my-counter: timeout
 Collecting support bundle ⠴ my-counter
support-bundle-2023-05-23T17_40_09.tar.gz

============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
run-pod/my-counter (F) : 5,063ms
cluster-resources (S)  : 2,131ms
cluster-info (S)       : 2ms

============ Redactors summary =============
In-cluster collectors : 511ms

============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed

Duration: 7,735ms
No resources found in default namespace.

Additional context

This issue spawned from https://github.com/replicatedhq/troubleshoot/pull/1172#issuecomment-1559272764 conversation

DexterYan commented 1 year ago

Hey @banjoh,

Would to seek your advice about this PR https://github.com/replicatedhq/troubleshoot/pull/1194

I have used a different way. In my PR, I am checking and waiting for the pod to be fully deleted, instead of deleting it twice with a grace period set to 0. So it has passed your example spec.

However, I am using a maximum 2 minutes to wait for a pod to be totally deleted. If it is over 2 minutes, I just throw an error and suggest that it may have an issue in terminating. Would you think it is better to use force delete instead of throwing an error?

banjoh commented 1 year ago

Waiting for graceful shutdown is a better approach. My only concern here is how we will ensure we honor the collector's timeout duration. We document that it maxes out at 30s. In a scenario where a pod runs for 20s, it will be left with 10s for cleanup before the "framework" stops the collector and continues executing the next collector.

In addition to completing (timing out or cleanly), collector needs to ensure it cleans up after itself before the next one runs, else we start seeing issues like this.

banjoh commented 1 year ago

Waiting for graceful shutdown is a better approach. My only concern here is how we will ensure we honor the collector's timeout duration. We document that it maxes out at 30s. In a scenario where a pod runs for 20s, it will be left with 10s for cleanup before the "framework" stops the collector and continues executing the next collector.

In addition to completing (timing out or cleanly), collector needs to ensure it cleans up after itself before the next one runs, else we start seeing issues like this.

My comments in https://github.com/replicatedhq/troubleshoot/pull/1194 PR supersede comments here

xavpaice commented 1 year ago

@banjoh was this resolved by #1196?

banjoh commented 1 year ago

https://github.com/replicatedhq/troubleshoot/pull/1196 addressed an issue in the copy_from_host collector which launches a daemon set, runs a command to copy files from the host and exits. This issue is for the run_pod collector which launches a pod but has no control of what commands will be run, hence bullet one in the description.

replicatedhq / troubleshoot

RunPod collector improvements #1177