tektoncd / results

Long term storage of execution results.
Apache License 2.0
77 stars 73 forks source link

do not lose flush error on server side update log #756

Closed gabemontero closed 4 months ago

gabemontero commented 4 months ago

Changes

/kind bug

An intermittent server side update log error I discussed with @sayan-biswas @avinal @khrm @enarha last month:

{"level":"error","ts":1713278748.4461074,"caller":"v1alpha2/logs.go:103","msg":"operation error S3: UploadPart, https response error StatusCode: 404, RequestID: 732R1N8N4J9RSB06, HostID: lsBFw/50Pfgee1X946YoNjrGdEfnafH1KmsVxQdqZXNGqNDuk2Vdka8vSm13Kx3h88Vbyq9HM7A=, api error NoSuchUpload: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.","stacktrace":
"github.com/tektoncd/results/pkg/api/server/v1alpha2.(*Server).UpdateLog.func1\n
\t/opt/app-root/src/pkg/api/server/v1alpha2/logs.go:103\n
github.com/tektoncd/results/pkg/api/server/v1alpha2.(*Server).UpdateLog\n
\t/opt/app-root/src/pkg/api/server/v1alpha2/logs.go:156\n
github.com/tektoncd/results/proto/v1alpha2/results_go_proto._Logs_UpdateLog_Handler\n
\t/opt/app-root/src/proto/v1alpha2/results_go_proto/api_grpc.pb.go:686\n
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/recovery.StreamServerInterceptor.func1\n\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/recovery/interceptors.go:48\n
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n
\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:49\n
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).StreamServerInterceptor.func1\n
\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:121\n
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n
\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:49\n
github.com/grpc-ecosystem/go-grpc-middleware/auth.StreamServerInterceptor.func1\n
\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/auth/auth.go:66\n
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n
\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:49
\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.StreamServerInterceptor.func1\n
\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/logging/zap/server_interceptors.go:53\n
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n
\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:49\n
github.com/grpc-ecosystem/go-grpc-middleware/tags.StreamServerInterceptor.func1\n
\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/tags/interceptors.go:39\n
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n
\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:49\n
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1\n
\t/opt/app-root/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:58\n
google.golang.org/grpc.(*Server).processStreamingRPC\n
\t/opt/app-root/src/vendor/google.golang.org/grpc/server.go:1673\n
google.golang.org/grpc.(*Server).handleStream\n
\t/opt/app-root/src/vendor/google.golang.org/grpc/server.go:1787\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n
\t/opt/app-root/src/vendor/google.golang.org/grpc/server.go:1016"}

If we get an error on the flush let's return it to the client in case retry is possible

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

Release Notes

NONE
tekton-robot commented 4 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 68.4% 57.1% -11.3
enarha commented 4 months ago

It feels to me that we are masking the "real" error which got us to handleReturn in the first place. I mean when we try to flush the log and fail. Maybe at least concatenate the error messages like "got flushErr while handling otherError"?

gabemontero commented 4 months ago

It feels to me that we are masking the "real" error which got us to handleReturn in the first place. I mean when we try to flush the log and fail. Maybe at least concatenate the error messages like "got flushErr while handling otherError"?

yeah I was wondering about that as well after I submitted the PR @enarha .... I'll look into doing this. Will post a comment when I 've updated it.

gabemontero commented 4 months ago

It feels to me that we are masking the "real" error which got us to handleReturn in the first place. I mean when we try to flush the log and fail. Maybe at least concatenate the error messages like "got flushErr while handling otherError"?

yeah I was wondering about that as well after I submitted the PR @enarha .... I'll look into doing this. Will post a comment when I 've updated it.

I've updated the PR @enarha to aggregate the errors ... PTAL

tekton-robot commented 4 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 68.4% 63.4% -5.0
tekton-robot commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enarha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/tektoncd/results/blob/main/OWNERS)~~ [enarha] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment