Open iainsproat opened 1 year ago
The API pod logs if this helps:
{"level":"error","ts":1687800526.2499058,"caller":"v1alpha2/logs.go:61","msg":"no logs exist for bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs/77a71268-1422-39f4-8b37-8e5264c6f2ae","stacktrace":"github.com/tektoncd/results/pkg/api/server/v1alpha2.(*Server).GetL
og\n\tgithub.com/tektoncd/results/pkg/api/server/v1alpha2/logs.go:61\ngithub.com/tektoncd/results/proto/v1alpha2/results_go_proto._Logs_GetLog_Handler\n\tgithub.com/tektoncd/results/proto/v1alpha2/results_go_proto/api_grpc.pb.go:579\ngithub.com/grpc-ecosystem/go-grpc-promethe
us.(*ServerMetrics).StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:121\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.
com/grpc-ecosystem/go-grpc-middleware/auth.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/auth/auth.go:66\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/cha
in.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:53\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.co
m/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/tags.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tags/interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServ
er.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\tgo
ogle.golang.org/grpc@v1.53.0/server.go:1627\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/grpc@v1.53.0/server.go:1708\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\tgoogle.golang.org/grpc@v1.53.0/server.go:965"}
@iainsproat Which log type did you set in the config? File or S3
Can you share the log message from UpdateLog
method?
Also logs are not enabled by default. Although the API can be enabled from the config, separate configuration are required. For File
type, PVC need to be deployed and for S3
type, the rest of the S3 related parameter in the config needs to be set.
There is a PR merged recently which makes this easier through kustomize.
https://github.com/tektoncd/results/blob/main/config/README.md#deploying-with-a-kustomization
Thanks for responding @sayan-biswas
Your suggestion about looking at the logs for the UpdateLog method pointed me in the right location. I can see that there appears to be a race condition in Tekton Results Watcher between the deletion of the Task Run by Tekton Results Watcher after it has completed and the uploading of logs by the Watcher.
I am running Tekton results watcher with the following flag: -completed_run_grace_period -1s
. It has a negative value to immediately 'garbage collect' the completed TaskRuns & PipelineRuns, as suggested by the documentation.
In the Tekton Watcher logs I can see that garbage collection occurring successfully:
{"level":"info","time":"2023-06-27T10:21:24.301Z","logger":"watcher","caller":"dynamic/dynamic.go:216","msg":"Deleting object","commit":"26df57c-dirty","knative.dev/traceid":"e55448ea-474f-449f-8f5e-6b1800c70bb5","knative.dev/key":"789b32327d/hello-task-run","results.tekton.dev/kind":"TaskRun","results.tekton.dev/result":"789b32327d/results/24f8d7c5-0caa-4999-bd4c-52ffedcf6f43","results.tekton.dev/record":"789b32327d/results/24f8d7c5-0caa-4999-bd4c-52ffedcf6f43/records/24f8d7c5-0caa-4999-bd4c-52ffedcf6f43","results.tekton.dev/gracePeriod":"-1s","results.tekton.dev/uid":"24f8d7c5-0caa-4999-bd4c-52ffedcf6f43","results.tekton.dev/time-taken-seconds":0}
However, some time later Tekton Watcher then logs the following failure:
{"level":"error","time":"2023-06-27T10:21:24.320Z","logger":"watcher","caller":"dynamic/dynamic.go:319","msg":"Error streaming log","commit":"26df57c-dirty","knative.dev/traceid":"e55448ea-474f-449f-8f5e-6b1800c70bb5","knative.dev/key":"789b32327d/hello-task-run","results.tekton.dev/kind":"TaskRun","namespace":"789b32327d","kind":"TaskRun","name":"hello-task-run","error":"error reading from tkn reader: Unable to get TaskRun: taskruns.tekton.dev \"hello-task-run\" not found","stacktrace":"github.com/tektoncd/results/pkg/watcher/reconciler/dynamic.(*Reconciler).sendLog.func1\n\tgithub.com/tektoncd/results/pkg/watcher/reconciler/dynamic/dynamic.go:319"}
As it is an intermittent issue and not consistent I am confident that S3 is configured correctly. I can query the blob storage provider directly and see the data for other Tekton Task and Pipeline Runs (which presumably won the race condition!) have been stored, and can be queried via the API.
My expectation is that the garbage collection of completed TaskRuns/PipelineRuns would wait until all the logs are confirmed as having been successfully stored before proceeding to delete the completed TaskRun or PipelineRun.
In the meantime, as a workaround, I'll increase the value of -completed_run_grace_period
flag to be a larger positive number to allow sufficient time for the logging to completed prior to garbage collection.
I've attached the entire logs output for both the API and Watcher. During this logged period I started Tekton Results, applied a Task, and then applied a TaskRun. tekton-watcher-logs.txt tekton-api-logs.txt
@iainsproat Yes, this happens because the log streaming from the watcher to the API server only starts after the PipelineRun/Taskrun moves to completed
of failed
status. If the pod gets deleted from an aggressive pruning such as immediately
then it creates a race condition.
Not sure about what can be done about fetching the logs, but the pruning conditions can be modified to wait for the streaming to complete even if it's set to -1s
.
/assign
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle rotten
Send feedback to tektoncd/plumbing.
/assign
Expected Behavior
This results in the following as expected:
I expect that using the
.records[0].name
would return the individual log record that can be seen in the list, e.g.:I may be doing something incorrect here, but it seems to match the API documentation.
What I expect to have returned is the following:
Actual Behavior
Retrieving the
.records[0].name
and attempting to access the resource results in the log record not being found.Steps to Reproduce the Problem
Steps detailed above
Additional Info
Kubernetes version:
Output of
kubectl version
:Tekton Pipeline version: