tektoncd / results

Long term storage of execution results.
Apache License 2.0
77 stars 73 forks source link

add timeout to dynamic Reconcile; switch from CloseSend to CloseRecv in streamLogs; add goroutine dump on timeouts/deadlocks #725

Closed gabemontero closed 5 months ago

gabemontero commented 6 months ago

Changes

So we ran with the change from #712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events.

We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear whether the #712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why.

This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock: 1) we have added a timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem 2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not intermittenly cancel an UpdateLog call that would have otherwise succeeded. 3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces 4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete. 5) we now log the dealines for UpdateLog on the api server side

rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

Release Notes

Address deadlocked reconciler threads potentially hung while streaming logs
tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 63.9% -0.3
gabemontero commented 6 months ago

OK https://github.com/tektoncd/results/pull/725/commits/4cc4cab884d948928125ef7021213bb2a82aa2ee passed the e2e's

Next will in a separate commit attempt a more expansive use of context with timeout around all grpc calls from the dynamic reconciler.

tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 64.1% -0.1
gabemontero commented 6 months ago

ok @sayan-biswas I have 2 forms of a fix and/or work around for the results watcher hang we saw in our prod env today

the first commit just covers updating logs to the results api server

the second commit handles all grpc calls to the api server coming out of the dynamic reconciler

after some code review, it centers on the notion that we were blocked on an grpc call to the api server, where the api server may have been in a bad state, given I noticed an api restart around the time I recycled the watcher pod

So basically,

  1. by setting a timeout on the context used for all grpc calls out of the dynamic reconciler, we timeout and exit after 5 minutes if any of those calls are hung
  2. the second commit, in order to have 1 context cover both the log grpc and non-log gprc, has to move the call to streamLog back on the reconciler thread, as cancelling/cleaning up the parent context could inadvertently abort the UpdateLog call
  3. also employed the CloseAndRecv improvement on the UpdateLog call to again coordinate with cleaning up the context
tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 62.4% -1.8
tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 61.1% -3.1
tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 60.4% -3.8
gabemontero commented 6 months ago

ok @sayan-biswas I now have a third commit that deciphers if the context or any specific GRPC calls exist because of deadline exceeded, and if so, dumps a goroutine with stack traces list to stdout for analysis.

So, with our hand on Friday, either the tkn client threading ended in a deadlock, or more likely, we had a grpc call to the api server hang (perhaps because the api server became messed up, which might have occurred since it restarted on its own).

In either case, we should get a thread dump in the log that we can diagnose.

tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 58.9% -5.3
tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 58.9% -5.3
tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 58.5% -5.7
tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 68.3% 68.4% 0.1
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 58.5% -5.7
gabemontero commented 6 months ago

e2e's green with timeout full propagated, no go func for sendLogs

will update configuration of timeout now

tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
cmd/watcher/main.go Do not exist 0.0%
pkg/api/server/v1alpha2/logs.go 68.3% 68.4% 0.1
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 56.5% -7.7
gabemontero commented 6 months ago

@sayan-biswas @khrm @avinal @enarha this PR is ready for review, iterate, merge

tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 68.3% 68.4% 0.1
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.7% -8.5
gabemontero commented 6 months ago

@khrm @sayan-biswas this is ready for review again .... once we are good, I can squash as needed .. thanks

tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 68.3% 68.4% 0.1
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.3% -8.9
sayan-biswas commented 5 months ago

/approve

tekton-robot commented 5 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khrm, sayan-biswas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/tektoncd/results/blob/main/OWNERS)~~ [khrm,sayan-biswas] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
sayan-biswas commented 5 months ago

@khrm @sayan-biswas this is ready for review again .... once we are good, I can squash as needed .. thanks

Changes are fine. Please go ahead and squash.

gabemontero commented 5 months ago

thanks @sayan-biswas @khrm commits are squashed

tekton-robot commented 5 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 68.3% 68.4% 0.1
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.3% -8.9
sayan-biswas commented 5 months ago

/lgtm