Closed gabemontero closed 5 months ago
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 63.9% | -0.3 |
OK https://github.com/tektoncd/results/pull/725/commits/4cc4cab884d948928125ef7021213bb2a82aa2ee passed the e2e's
Next will in a separate commit attempt a more expansive use of context with timeout around all grpc calls from the dynamic reconciler.
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 64.1% | -0.1 |
ok @sayan-biswas I have 2 forms of a fix and/or work around for the results watcher hang we saw in our prod env today
the first commit just covers updating logs to the results api server
the second commit handles all grpc calls to the api server coming out of the dynamic reconciler
after some code review, it centers on the notion that we were blocked on an grpc call to the api server, where the api server may have been in a bad state, given I noticed an api restart around the time I recycled the watcher pod
So basically,
CloseAndRecv
improvement on the UpdateLog call to again coordinate with cleaning up the contextThe following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 62.4% | -1.8 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 61.1% | -3.1 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 60.4% | -3.8 |
ok @sayan-biswas I now have a third commit that deciphers if the context or any specific GRPC calls exist because of deadline exceeded, and if so, dumps a goroutine with stack traces list to stdout for analysis.
So, with our hand on Friday, either the tkn client threading ended in a deadlock, or more likely, we had a grpc call to the api server hang (perhaps because the api server became messed up, which might have occurred since it restarted on its own).
In either case, we should get a thread dump in the log that we can diagnose.
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 58.9% | -5.3 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 58.9% | -5.3 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 58.5% | -5.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/api/server/v1alpha2/logs.go | 68.3% | 68.4% | 0.1 |
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 58.5% | -5.7 |
e2e's green with timeout full propagated, no go func for sendLogs
will update configuration of timeout now
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
cmd/watcher/main.go | Do not exist | 0.0% | |
pkg/api/server/v1alpha2/logs.go | 68.3% | 68.4% | 0.1 |
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 56.5% | -7.7 |
@sayan-biswas @khrm @avinal @enarha this PR is ready for review, iterate, merge
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/api/server/v1alpha2/logs.go | 68.3% | 68.4% | 0.1 |
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 55.7% | -8.5 |
@khrm @sayan-biswas this is ready for review again .... once we are good, I can squash as needed .. thanks
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/api/server/v1alpha2/logs.go | 68.3% | 68.4% | 0.1 |
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 55.3% | -8.9 |
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: khrm, sayan-biswas
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@khrm @sayan-biswas this is ready for review again .... once we are good, I can squash as needed .. thanks
Changes are fine. Please go ahead and squash.
thanks @sayan-biswas @khrm commits are squashed
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/api/server/v1alpha2/logs.go | 68.3% | 68.4% | 0.1 |
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 55.3% | -8.9 |
/lgtm
Changes
So we ran with the change from #712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events.
We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear whether the #712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why.
This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock: 1) we have added a timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem 2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not intermittenly cancel an UpdateLog call that would have otherwise succeeded. 3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces 4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete. 5) we now log the dealines for UpdateLog on the api server side
rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
/kind bug
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you review them:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes