tektoncd / results

Long term storage of execution results.
Apache License 2.0
77 stars 73 forks source link

move log storage off reconciler thread (again) #759

Closed gabemontero closed 3 months ago

gabemontero commented 3 months ago

Changes

There was a great disparity between performance once we applied sufficient concurrency on production scale systems with log storage on the reconciler thread.

After a couple of weeks of analysis between myself and @pmacik, the primary scaling bottleneck was at the golang http level, specifically the http2 Body.Read(), at https://github.com/tektoncd/results/blob/ca1ca870efebf9e122684134a9d38862c0be2375/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L420

For the time being then, that has forced us to move log storage back off of the log storage thread.

We will circle back to retry on recoverable errors once the Log status changes are in.

Details on the changes in this PR include:

/kind bug

@khrm @enarha PTAL

@sayan-biswas @avinal fyi / ptal if you all have cycles

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

Release Notes

Prior attempts to fix the watcher memory leak when log storage is enabled (originally issue #695) placed log storage on the dynamic reconciler to facilitate retry on retryable errors.  However, production level testing uncovered underlying golang/http2 performance issues which made this untenable.

This change moves log storage back off the dynamic reconciler thread, but still addresses the memory leak, as well as intermittent 'canceled context' errors during log storage through the results api server.

Two new command line arguments are enabled on the watch to facilitate timeouts on log storage:
1) dynamic_reconcile_timeout: How long the Watcher waits for the dynamic reconciler to complete before it aborts
2) update_log_timeout: How log the Watcher waits for the UpdateLog operation for storing logs to complete before it aborts.
tekton-robot commented 3 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 63.4% 63.2% -0.2
pkg/watcher/reconciler/dynamic/dynamic.go 55.3% 53.0% -2.3
tekton-robot commented 3 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 63.4% 63.2% -0.2
pkg/watcher/reconciler/dynamic/dynamic.go 55.3% 54.5% -0.8
tekton-robot commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: avinal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/tektoncd/results/blob/main/OWNERS)~~ [avinal] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment