Closed gabemontero closed 7 months ago
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 65.4% | -3.9 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 65.0% | -4.3 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 62.6% | -6.8 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 65.0% | -4.3 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 65.0% | -4.3 |
/retest
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 65.0% | -4.3 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 65.0% | -4.3 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 65.0% | -4.3 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 63.9% | -5.5 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 63.9% | -5.5 |
hey @sayan-biswas @avinal @ramessesii2 @enarha I'm going to need your help with the error I'm seeing in the integration tests here.
Turns out the UpdateLog
is not completing successfully. When I dump the API server logs, I see a couple context canceled
messages in the logs.
{"level":"info","ts":1708029631.050909,"caller":"zap/options.go:212","msg":"finished streaming call with code Canceled","grpc.auth_disabled":true,"grpc.start_time":"2024-02-15T20:40:30Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Logs","grpc.method":"UpdateLog","peer.address":"10.244.1.14:34252","grpc.user":"system:serviceaccount:tekton-pipelines:tekton-results-watcher","grpc.issuer":"https://kubernetes.default.svc.cluster.local","error":"rpc error: code = Canceled desc = context canceled","grpc.code":"Canceled","grpc.time_duration_in_ms":97}
and with my latest commit, I also get
{"level":"error","ts":1708032696.9937341,"caller":"zap/options.go:212","msg":"finished streaming call with code Unknown","grpc.auth_disabled":true,"grpc.start_time":"2024-02-15T21:31:36Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Logs","grpc.method":"UpdateLog","peer.address":"10.244.1.14:54414","grpc.user":"system:serviceaccount:tekton-pipelines:tekton-results-watcher","grpc.issuer":"https://kubernetes.default.svc.cluster.local","error":"context canceled","grpc.code":"Unknown","grpc.time_duration_in_ms":48,"stacktrace":"github.com/grpc-ecosystem/go-grpc-middleware/logging/zap.DefaultMessageProducer\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/options.go:212\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:61\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/tags.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tags/interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1673\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1787\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1016"}
Now, I first thought it was related to the child context we originally put in the fix, but I've made a couple of successive commits to first remove the calling of the cancel function, and then remove the use of the child context, but they have not had any effect. Still get this canceled context
error on the UpdateLog
I did add debug in the goroutine and confirm that Flush
call returns no error, and there were positive counts wrt data being flushed:
{"level":"info","time":"2024-02-15T21:14:33.290Z","logger":"watcher","caller":"dynamic/dynamic.go:432","msg":"GGMGGM1 flush ret count","commit":"d7fe8ee","knative.dev/traceid":"65b9ff5a-eed4-415b-9b87-65df76f0cabb","knative.dev/key":"default/hello-gcs-hello","results.tekton.dev/kind":"TaskRun","name":"hello-gcs-hello","flushCount":162}
{"level":"info","time":"2024-02-15T21:14:33.330Z","logger":"watcher","caller":"dynamic/dynamic.go:432","msg":"GGMGGM1 flush ret count","commit":"d7fe8ee","knative.dev/traceid":"534637e7-7558-4ef5-bcba-0ce8b864a7d3","knative.dev/key":"default/hello-gcs","results.tekton.dev/kind":"PipelineRun","name":"hello-gcs","flushCount":197}
WDYT?
thanks
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 63.6% | -5.7 |
good news @sayan-biswas @avinal @ramessesii2 @enarha - with https://github.com/tektoncd/results/pull/712/commits/c76523cde4b493e39b4cb1556a1dc8ac9987e1d3 I have bypassed the canceled context error in the UpdateLogs on the API server.
Seeing that error with the mem leak fix maybe makes sense, in that we are doing better at unblocking the tknlog.NewWriter(..).Write(...) call but I'm having @pmacik run a parallel test with release 0.9.1 so we can confirm whether the error was ever happening with that version.
I just now need to hack on this fix some more so that we sufficiently block on the context before we exit streamLogs
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 62.7% | -6.6 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 61.6% | -7.7 |
hmmm ... so there is still a timing window @sayan-biswas @avinal @ramessesii2 @enarha even with my blocking on channels in streamLogs, given all the asynchronous processing going on.
The relevant logs from the watcher and api server:
{"level":"info","time":"2024-02-16T19:39:52.618Z","logger":"watcher","caller":"dynamic/dynamic.go:485","msg":"GGM6 streamLogs after final select exiting","commit":"aef54ef","knative.dev/traceid":"aeab2575-5f45-4a10-97e1-0596418e2d61","knative.dev/key":"default/hello-gcs","results.tekton.dev/kind":"PipelineRun","namespace":"default","name":"hello-gcs"}
{"level":"info","ts":1708112392.5995204,"caller":"zap/options.go:212","msg":"finished streaming call with code Canceled","grpc.auth_disabled":true,"grpc.start_time":"2024-02-16T19:39:52Z","grpc.request.deadline":"2024-02-16T19:49:52Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Logs","grpc.method":"UpdateLog","peer.address":"10.244.1.14:43978","grpc.user":"system:serviceaccount:tekton-pipelines:tekton-results-watcher","grpc.issuer":"https://kubernetes.default.svc.cluster.local","error":"rpc error: code = Canceled desc = context canceled","grpc.code":"Canceled","grpc.time_duration_in_ms":85}
So we either
1) filter dynamic.go from the go vet
so we don't get that complaint, though it certainly is not best practice
2) add a Recv()
call using the GetLog client to confirm the UpdateLog
is complete .... need to try it out, but I'm seeing comments in some grpc-go issues which talks about that call blocking
3) or some other idea one of us comes up with
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 57.6% | -11.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 58.9% | -10.4 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 56.7% | -12.6 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 56.7% | -12.6 |
ok @sayan-biswas @avinal @ramessesii2 @enarha the commit https://github.com/tektoncd/results/pull/712/commits/8831117dde4e10f7f9a9af0fd57e9889e81b7451 proves the timing issue noted in https://github.com/tektoncd/results/pull/712#issuecomment-1949267757 in the run https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/712/pull-tekton-results-integration-tests/1758610901969145856 where there are still timing windows with the context cleanup from cancel() (which go vet want to require) killing the UpdateLog
processing in the API server.
now, I had to employ wait.PollImmediately...
on the Recv()
call in this current commit to mitigate the timing window so that we could verify the UpdateLog
update passed in the api server but still have the context cancel function executed when we leave streamLogs
we can discuss voice to voice if you like about this if need be, but use of wait.PollImmediately...
is not a viable thing to perform on a controller reconciler thread ... just did this temporarily to prove the timing window situation
I believe this means that we have to bypass the second goroutine launched in streamLogs
where tkncli writes directly to grpc, and instead, first write to an in memory buffer on the main thread in streamLogs
, and then call logs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).Write()
followed by logs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).Flush()
and logs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).CloseSend()
on the main thread
I'll take a shot at this before I sign off for today; otherwise I'll pick up on Monday, where we hopefully can discuss voice to voice in standup
@adambkaplan FYI
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 65.6% | -3.7 |
ok @sayan-biswas @avinal @ramessesii2 @enarha the commit 8831117 proves the timing issue noted in #712 (comment) in the run https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/712/pull-tekton-results-integration-tests/1758610901969145856 where there are still timing windows with the context cleanup from cancel() (which go vet want to require) killing the
UpdateLog
processing in the API server.now, I had to employ
wait.PollImmediately...
on theRecv()
call in this current commit to mitigate the timing window so that we could verify theUpdateLog
update passed in the api server but still have the context cancel function executed when we leavestreamLogs
we can discuss voice to voice if you like about this if need be, but use of
wait.PollImmediately...
is not a viable thing to perform on a controller reconciler thread ... just did this temporarily to prove the timing window situationI believe this means that we have to bypass the second goroutine launched in
streamLogs
where tkncli writes directly to grpc, and instead, first write to an in memory buffer on the main thread instreamLogs
, and then calllogs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).Write()
followed bylogs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).Flush()
andlogs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).CloseSend()
on the main threadI'll take a shot at this before I sign off for today; otherwise I'll pick up on Monday, where we hopefully can discuss voice to voice in standup
@adambkaplan FYI
duh tired and wasn't thinking .... can still hit the cancel timing window even when single threaded
though the code is much simpler this way, thinking one goroutine off of sendLogs
is sufficient
either need the icky wait.Poll...
logic before exiting streamLogs
, or don't do the cancel and get `go vet to filter this module, or another idea not occurring to me after a full day
see y'all Monday
@sayan-biswas @avinal @ramessesii2 @enarha @adambkaplan FYI
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 65.2% | -4.1 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 65.2% | -4.1 |
@sayan-biswas @avinal @ramessesii2 @enarha @adambkaplan sorted it out .... with more simplification to boot
no new context needed in streamLogs
; no context timeout needed; just launch a separate thread in sendLogs
for now, which I may want to get rid of as well, though maybe a separate PR, as we can bump the working thread pool
we'll see after testing with Pavel
@gabemontero I have one question. Why are we passing same inMemWriteBuffer
to both out
and err
writer?
What happens, if while buffering there's an error?
In the above line is it going to send the error message to the API server instead of the log message?
@gabemontero I have one question. Why are we passing same
inMemWriteBuffer
to bothout
anderr
writer? What happens, if while buffering there's an error?In the above line is it going to send the error message to the API server instead of the log message?
yep I see what you mean @sayan-biswas .... I will update this to create a separate err buffer on the stack which is used to generate log messages, return errors, as needed
fwiw this bit should not have any bearing on the soak test with this PR that @pmacik started ... ultimately that test is not validating the precise contents of the stored log; only that that update log has completed and that the memory and GC looks OK
thanks
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 63.1% | -6.2 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 63.1% | -6.2 |
@gabemontero I have one question. Why are we passing same
inMemWriteBuffer
to bothout
anderr
writer? What happens, if while buffering there's an error? https://github.com/tektoncd/results/blob/8a81833caec8138f08f2566c68b2926920c6f0de/pkg/watcher/reconciler/dynamic/dynamic.go#L407In the above line is it going to send the error message to the API server instead of the log message?
yep I see what you mean @sayan-biswas .... I will update this to create a separate err buffer on the stack which is used to generate log messages, return errors, as needed
fwiw this bit should not have any bearing on the soak test with this PR that @pmacik started ... ultimately that test is not validating the precise contents of the stored log; only that that update log has completed and that the memory and GC looks OK
thanks
initial separation of the error channel complete @sayan-biswas
for now, I'll refrain from posting here any details from the internal slack thread we have around error handling; let's see what initial results we get from the currently running soak tests
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 64.6% | -4.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 63.1% | -6.2 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 63.0% | -6.3 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 63.4% | -5.9 |
my goodness golint even complains about misspelling words :-)
pkg/watcher/reconciler/dynamic/dynamic.go:374:5: `reconcilation` is a misspelling of `reconciliation` (misspell)
// reconcilation in this controller
^
make: *** [Makefile:88: golangci-lint-check] Error 1
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 64.2% | -5.1 |
@sayan-biswas @avinal @ramessesii2 @enarha I think I would be good to get this PR in first, to help provide some immediately relief with our deployment, and then I can rebase #715 as needed and after our stress test for that one is done, follow up with getting you all to get that ready for merge
PTAL
@sayan-biswas @avinal @ramessesii2 @enarha I think I would be good to get this PR in first, to help provide some immediately relief with our deployment, and then I can rebase #715 as needed and after our stress test for that one is done, follow up with getting you all to get that ready for merge
PTAL
bump @sayan-biswas @avinal per my comment https://github.com/tektoncd/results/pull/715#discussion_r1501847972 let's merge this before #715 to get some immediate relief for the memory leak as well as get some baselines in production environments before we entertain #715
I've squashed the commits (there were several temp/debug commits in there)
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 64.2% | -5.1 |
/hold cancel
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: avinal, sayan-biswas
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/lgtm
Changes
Fixes https://github.com/tektoncd/results/issues/695
possibly
my team (including @ramessesii2 @avinal @sayan-biswas @pmacik) is doing some internal testing to both zero in on the cause of the leak, as well as see if our theory around the goroutines not getting GC'ed based on what we see in testing proves out.
I've marked this as WIP and hold because aside from still needing to verify things, this PR currently has some debugging prints we will eventually remove.
/kind bug /hold
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you review them:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes