Open gabemontero opened 2 months ago
revitalizing https://github.com/tektoncd/results/pull/715 would bypass the tkn client issues noted at https://github.com/tektoncd/results/blob/c34e40d11f244d35f9de5721dcd9f5efb68f6212/pkg/watcher/reconciler/dynamic/dynamic.go#L512 and allow us to indicate in the stored logs there were no pods to dump.
Expected Behavior
A TaskRun never leaving Pending state, with its underlying pod started, should have this fact made clear in log storage
Actual Behavior
No such information occurs
Steps to Reproduce the Problem
Additional Info
With https://github.com/tektoncd/results/pull/699 we fixed the situation in general where if a timeout/cancel occurred, we would still go on to fetch/store the underlying pod logs.
However, in systems with quotas or severe node pressure at the k8s level, TaskRuns can stay stuck in Pending and any created Pods will never get started.
If you see the comments at https://github.com/tektoncd/results/blob/c34e40d11f244d35f9de5721dcd9f5efb68f6212/pkg/watcher/reconciler/dynamic/dynamic.go#L512 you'll see the prior observations of tkn making the distinction of errors difficult, and thus, errors with tkn getting logs are ignored.
That is proving unusable for users how may not have access to view events, pods, or etcd entities in general before the attempt to store logs occurs and then the pipelinerun/taskrun are potentially pruned form etcd.
before exiting the
streamLogs
code needs to confirm if any underlying pods for TaskRuns exist, and if not, store any helpful debug info in what is set to the GRPCUpdateLog
call and/or direct S3 storage. In particularI'll also attach a PR/TR which was timedout/cancelled where the taskrun never left Pending state.
You'll see from the annotations that they go from pending straight to a terminal state, meaning a pod never got associated.
pr-tr.zip
@khrm @sayan-biswas @avinal @enarha FYI / PTAL / WDYT