tektoncd / results

Long term storage of execution results.
Apache License 2.0
78 stars 74 forks source link

Individual log is not retrievable via the name returned in the log list #514

Open iainsproat opened 1 year ago

iainsproat commented 1 year ago

Expected Behavior

curl -k https://speckle.internal:8088/apis/results.tekton.dev/v1alpha2/parents/bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs

This results in the following as expected:

{"records":[{"name":"bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs/77a71268-1422-39f4-8b37-8e5264c6f2ae", "id":"58f21208-d056-4506-a1c8-124b3e34c830", "uid":"58f21208-d056-4506-a1c8-124b3e34c830", "data":{"type":"results.tekton.dev/v1alpha2.Log", "value":"base64encodedlogs"}, "etag":"58f21208-d056-4506-a1c8-124b3e34c830-1687799787144731088", "createdTime":"2023-06-26T17:16:27.144731Z", "createTime":"2023-06-26T17:16:27.144731Z", "updatedTime":"2023-06-26T17:16:27.144731Z", "updateTime":"2023-06-26T17:16:27.144731Z"}], "nextPageToken":""}`

I expect that using the .records[0].name would return the individual log record that can be seen in the list, e.g.:

curl -k https://speckle.internal:8088/apis/results.tekton.dev/v1alpha2/parents/bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs/77a71268-1422-39f4-8b37-8e5264c6f2ae

I may be doing something incorrect here, but it seems to match the API documentation.

What I expect to have returned is the following:

{"name":"bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs/77a71268-1422-39f4-8b37-8e5264c6f2ae", "id":"58f21208-d056-4506-a1c8-124b3e34c830", "uid":"58f21208-d056-4506-a1c8-124b3e34c830", "data":{"type":"results.tekton.dev/v1alpha2.Log", "value":"base64encodedlogs"}

Actual Behavior

Retrieving the .records[0].name and attempting to access the resource results in the log record not being found.

curl -k https://speckle.internal:8088/apis/results.tekton.dev/v1alpha2/parents/bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs/77a71268-1422-39f4-8b37-8e5264c6f2ae
{"error":{"code":5,"message":"Log doesn't exist","details":[]}}

Steps to Reproduce the Problem

Steps detailed above

Additional Info

iainsproat commented 1 year ago

The API pod logs if this helps:

{"level":"error","ts":1687800526.2499058,"caller":"v1alpha2/logs.go:61","msg":"no logs exist for bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs/77a71268-1422-39f4-8b37-8e5264c6f2ae","stacktrace":"github.com/tektoncd/results/pkg/api/server/v1alpha2.(*Server).GetL
og\n\tgithub.com/tektoncd/results/pkg/api/server/v1alpha2/logs.go:61\ngithub.com/tektoncd/results/proto/v1alpha2/results_go_proto._Logs_GetLog_Handler\n\tgithub.com/tektoncd/results/proto/v1alpha2/results_go_proto/api_grpc.pb.go:579\ngithub.com/grpc-ecosystem/go-grpc-promethe
us.(*ServerMetrics).StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:121\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.
com/grpc-ecosystem/go-grpc-middleware/auth.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/auth/auth.go:66\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/cha
in.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:53\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.co
m/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/tags.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tags/interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServ
er.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\tgo
ogle.golang.org/grpc@v1.53.0/server.go:1627\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/grpc@v1.53.0/server.go:1708\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\tgoogle.golang.org/grpc@v1.53.0/server.go:965"}
sayan-biswas commented 1 year ago

@iainsproat Which log type did you set in the config? File or S3 Can you share the log message from UpdateLog method?

sayan-biswas commented 1 year ago

Also logs are not enabled by default. Although the API can be enabled from the config, separate configuration are required. For File type, PVC need to be deployed and for S3 type, the rest of the S3 related parameter in the config needs to be set.
There is a PR merged recently which makes this easier through kustomize. https://github.com/tektoncd/results/blob/main/config/README.md#deploying-with-a-kustomization

iainsproat commented 1 year ago

Thanks for responding @sayan-biswas

Your suggestion about looking at the logs for the UpdateLog method pointed me in the right location. I can see that there appears to be a race condition in Tekton Results Watcher between the deletion of the Task Run by Tekton Results Watcher after it has completed and the uploading of logs by the Watcher.

I am running Tekton results watcher with the following flag: -completed_run_grace_period -1s. It has a negative value to immediately 'garbage collect' the completed TaskRuns & PipelineRuns, as suggested by the documentation.

In the Tekton Watcher logs I can see that garbage collection occurring successfully:

{"level":"info","time":"2023-06-27T10:21:24.301Z","logger":"watcher","caller":"dynamic/dynamic.go:216","msg":"Deleting object","commit":"26df57c-dirty","knative.dev/traceid":"e55448ea-474f-449f-8f5e-6b1800c70bb5","knative.dev/key":"789b32327d/hello-task-run","results.tekton.dev/kind":"TaskRun","results.tekton.dev/result":"789b32327d/results/24f8d7c5-0caa-4999-bd4c-52ffedcf6f43","results.tekton.dev/record":"789b32327d/results/24f8d7c5-0caa-4999-bd4c-52ffedcf6f43/records/24f8d7c5-0caa-4999-bd4c-52ffedcf6f43","results.tekton.dev/gracePeriod":"-1s","results.tekton.dev/uid":"24f8d7c5-0caa-4999-bd4c-52ffedcf6f43","results.tekton.dev/time-taken-seconds":0}

However, some time later Tekton Watcher then logs the following failure:

{"level":"error","time":"2023-06-27T10:21:24.320Z","logger":"watcher","caller":"dynamic/dynamic.go:319","msg":"Error streaming log","commit":"26df57c-dirty","knative.dev/traceid":"e55448ea-474f-449f-8f5e-6b1800c70bb5","knative.dev/key":"789b32327d/hello-task-run","results.tekton.dev/kind":"TaskRun","namespace":"789b32327d","kind":"TaskRun","name":"hello-task-run","error":"error reading from tkn reader: Unable to get TaskRun: taskruns.tekton.dev \"hello-task-run\" not found","stacktrace":"github.com/tektoncd/results/pkg/watcher/reconciler/dynamic.(*Reconciler).sendLog.func1\n\tgithub.com/tektoncd/results/pkg/watcher/reconciler/dynamic/dynamic.go:319"}

As it is an intermittent issue and not consistent I am confident that S3 is configured correctly. I can query the blob storage provider directly and see the data for other Tekton Task and Pipeline Runs (which presumably won the race condition!) have been stored, and can be queried via the API.

My expectation is that the garbage collection of completed TaskRuns/PipelineRuns would wait until all the logs are confirmed as having been successfully stored before proceeding to delete the completed TaskRun or PipelineRun.

In the meantime, as a workaround, I'll increase the value of -completed_run_grace_period flag to be a larger positive number to allow sufficient time for the logging to completed prior to garbage collection.

I've attached the entire logs output for both the API and Watcher. During this logged period I started Tekton Results, applied a Task, and then applied a TaskRun. tekton-watcher-logs.txt tekton-api-logs.txt

sayan-biswas commented 1 year ago

@iainsproat Yes, this happens because the log streaming from the watcher to the API server only starts after the PipelineRun/Taskrun moves to completed of failed status. If the pod gets deleted from an aggressive pruning such as immediately then it creates a race condition. Not sure about what can be done about fetching the logs, but the pruning conditions can be modified to wait for the streaming to complete even if it's set to -1s .

sayan-biswas commented 1 year ago

/assign

tekton-robot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

ramessesii2 commented 9 months ago

/assign