tektoncd / results

Long term storage of execution results.
Apache License 2.0
77 stars 73 forks source link

employ some channel best practices to avoid hanging goroutines #712

Closed gabemontero closed 7 months ago

gabemontero commented 7 months ago

Changes

Fixes https://github.com/tektoncd/results/issues/695

possibly

my team (including @ramessesii2 @avinal @sayan-biswas @pmacik) is doing some internal testing to both zero in on the cause of the leak, as well as see if our theory around the goroutines not getting GC'ed based on what we see in testing proves out.

I've marked this as WIP and hold because aside from still needing to verify things, this PR currently has some debugging prints we will eventually remove.

/kind bug /hold

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

Release Notes

Fixes around the threading / goroutines employed while storing logs mitigate an observed memory leak
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.4% -3.9
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 62.6% -6.8
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3
gabemontero commented 7 months ago

/retest

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 63.9% -5.5
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 63.9% -5.5
gabemontero commented 7 months ago

hey @sayan-biswas @avinal @ramessesii2 @enarha I'm going to need your help with the error I'm seeing in the integration tests here.

Turns out the UpdateLog is not completing successfully. When I dump the API server logs, I see a couple context canceled messages in the logs.

        {"level":"info","ts":1708029631.050909,"caller":"zap/options.go:212","msg":"finished streaming call with code Canceled","grpc.auth_disabled":true,"grpc.start_time":"2024-02-15T20:40:30Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Logs","grpc.method":"UpdateLog","peer.address":"10.244.1.14:34252","grpc.user":"system:serviceaccount:tekton-pipelines:tekton-results-watcher","grpc.issuer":"https://kubernetes.default.svc.cluster.local","error":"rpc error: code = Canceled desc = context canceled","grpc.code":"Canceled","grpc.time_duration_in_ms":97}

and with my latest commit, I also get

        {"level":"error","ts":1708032696.9937341,"caller":"zap/options.go:212","msg":"finished streaming call with code Unknown","grpc.auth_disabled":true,"grpc.start_time":"2024-02-15T21:31:36Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Logs","grpc.method":"UpdateLog","peer.address":"10.244.1.14:54414","grpc.user":"system:serviceaccount:tekton-pipelines:tekton-results-watcher","grpc.issuer":"https://kubernetes.default.svc.cluster.local","error":"context canceled","grpc.code":"Unknown","grpc.time_duration_in_ms":48,"stacktrace":"github.com/grpc-ecosystem/go-grpc-middleware/logging/zap.DefaultMessageProducer\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/options.go:212\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:61\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/tags.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tags/interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1673\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1787\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1016"}

Now, I first thought it was related to the child context we originally put in the fix, but I've made a couple of successive commits to first remove the calling of the cancel function, and then remove the use of the child context, but they have not had any effect. Still get this canceled context error on the UpdateLog

I did add debug in the goroutine and confirm that Flush call returns no error, and there were positive counts wrt data being flushed:

{"level":"info","time":"2024-02-15T21:14:33.290Z","logger":"watcher","caller":"dynamic/dynamic.go:432","msg":"GGMGGM1 flush ret count","commit":"d7fe8ee","knative.dev/traceid":"65b9ff5a-eed4-415b-9b87-65df76f0cabb","knative.dev/key":"default/hello-gcs-hello","results.tekton.dev/kind":"TaskRun","name":"hello-gcs-hello","flushCount":162}
{"level":"info","time":"2024-02-15T21:14:33.330Z","logger":"watcher","caller":"dynamic/dynamic.go:432","msg":"GGMGGM1 flush ret count","commit":"d7fe8ee","knative.dev/traceid":"534637e7-7558-4ef5-bcba-0ce8b864a7d3","knative.dev/key":"default/hello-gcs","results.tekton.dev/kind":"PipelineRun","name":"hello-gcs","flushCount":197}

WDYT?

thanks

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 63.6% -5.7
gabemontero commented 7 months ago

good news @sayan-biswas @avinal @ramessesii2 @enarha - with https://github.com/tektoncd/results/pull/712/commits/c76523cde4b493e39b4cb1556a1dc8ac9987e1d3 I have bypassed the canceled context error in the UpdateLogs on the API server.

Seeing that error with the mem leak fix maybe makes sense, in that we are doing better at unblocking the tknlog.NewWriter(..).Write(...) call but I'm having @pmacik run a parallel test with release 0.9.1 so we can confirm whether the error was ever happening with that version.

I just now need to hack on this fix some more so that we sufficiently block on the context before we exit streamLogs

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 62.7% -6.6
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 61.6% -7.7
gabemontero commented 7 months ago

hmmm ... so there is still a timing window @sayan-biswas @avinal @ramessesii2 @enarha even with my blocking on channels in streamLogs, given all the asynchronous processing going on.

The relevant logs from the watcher and api server:

{"level":"info","time":"2024-02-16T19:39:52.618Z","logger":"watcher","caller":"dynamic/dynamic.go:485","msg":"GGM6 streamLogs after final select exiting","commit":"aef54ef","knative.dev/traceid":"aeab2575-5f45-4a10-97e1-0596418e2d61","knative.dev/key":"default/hello-gcs","results.tekton.dev/kind":"PipelineRun","namespace":"default","name":"hello-gcs"}

{"level":"info","ts":1708112392.5995204,"caller":"zap/options.go:212","msg":"finished streaming call with code Canceled","grpc.auth_disabled":true,"grpc.start_time":"2024-02-16T19:39:52Z","grpc.request.deadline":"2024-02-16T19:49:52Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Logs","grpc.method":"UpdateLog","peer.address":"10.244.1.14:43978","grpc.user":"system:serviceaccount:tekton-pipelines:tekton-results-watcher","grpc.issuer":"https://kubernetes.default.svc.cluster.local","error":"rpc error: code = Canceled desc = context canceled","grpc.code":"Canceled","grpc.time_duration_in_ms":85}

So we either 1) filter dynamic.go from the go vet so we don't get that complaint, though it certainly is not best practice 2) add a Recv() call using the GetLog client to confirm the UpdateLog is complete .... need to try it out, but I'm seeing comments in some grpc-go issues which talks about that call blocking 3) or some other idea one of us comes up with

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 57.6% -11.7
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 58.9% -10.4
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 56.7% -12.6
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 56.7% -12.6
gabemontero commented 7 months ago

ok @sayan-biswas @avinal @ramessesii2 @enarha the commit https://github.com/tektoncd/results/pull/712/commits/8831117dde4e10f7f9a9af0fd57e9889e81b7451 proves the timing issue noted in https://github.com/tektoncd/results/pull/712#issuecomment-1949267757 in the run https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/712/pull-tekton-results-integration-tests/1758610901969145856 where there are still timing windows with the context cleanup from cancel() (which go vet want to require) killing the UpdateLog processing in the API server.

now, I had to employ wait.PollImmediately... on the Recv() call in this current commit to mitigate the timing window so that we could verify the UpdateLog update passed in the api server but still have the context cancel function executed when we leave streamLogs

we can discuss voice to voice if you like about this if need be, but use of wait.PollImmediately... is not a viable thing to perform on a controller reconciler thread ... just did this temporarily to prove the timing window situation

I believe this means that we have to bypass the second goroutine launched in streamLogs where tkncli writes directly to grpc, and instead, first write to an in memory buffer on the main thread in streamLogs, and then call logs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).Write() followed by logs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).Flush() and logs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).CloseSend() on the main thread

I'll take a shot at this before I sign off for today; otherwise I'll pick up on Monday, where we hopefully can discuss voice to voice in standup

@adambkaplan FYI

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.6% -3.7
gabemontero commented 7 months ago

ok @sayan-biswas @avinal @ramessesii2 @enarha the commit 8831117 proves the timing issue noted in #712 (comment) in the run https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/712/pull-tekton-results-integration-tests/1758610901969145856 where there are still timing windows with the context cleanup from cancel() (which go vet want to require) killing the UpdateLog processing in the API server.

now, I had to employ wait.PollImmediately... on the Recv() call in this current commit to mitigate the timing window so that we could verify the UpdateLog update passed in the api server but still have the context cancel function executed when we leave streamLogs

we can discuss voice to voice if you like about this if need be, but use of wait.PollImmediately... is not a viable thing to perform on a controller reconciler thread ... just did this temporarily to prove the timing window situation

I believe this means that we have to bypass the second goroutine launched in streamLogs where tkncli writes directly to grpc, and instead, first write to an in memory buffer on the main thread in streamLogs, and then call logs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).Write() followed by logs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).Flush() and logs.NewBufferedWriter(logsClient, logName, logs.DefaultBufferSize).CloseSend() on the main thread

I'll take a shot at this before I sign off for today; otherwise I'll pick up on Monday, where we hopefully can discuss voice to voice in standup

@adambkaplan FYI

duh tired and wasn't thinking .... can still hit the cancel timing window even when single threaded

though the code is much simpler this way, thinking one goroutine off of sendLogs is sufficient

either need the icky wait.Poll... logic before exiting streamLogs, or don't do the cancel and get `go vet to filter this module, or another idea not occurring to me after a full day

see y'all Monday

@sayan-biswas @avinal @ramessesii2 @enarha @adambkaplan FYI

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.2% -4.1
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.2% -4.1
gabemontero commented 7 months ago

@sayan-biswas @avinal @ramessesii2 @enarha @adambkaplan sorted it out .... with more simplification to boot

no new context needed in streamLogs; no context timeout needed; just launch a separate thread in sendLogs for now, which I may want to get rid of as well, though maybe a separate PR, as we can bump the working thread pool

we'll see after testing with Pavel

sayan-biswas commented 7 months ago

@gabemontero I have one question. Why are we passing same inMemWriteBuffer to both out and err writer? What happens, if while buffering there's an error?

https://github.com/tektoncd/results/blob/8a81833caec8138f08f2566c68b2926920c6f0de/pkg/watcher/reconciler/dynamic/dynamic.go#L407

In the above line is it going to send the error message to the API server instead of the log message?

gabemontero commented 7 months ago

@gabemontero I have one question. Why are we passing same inMemWriteBuffer to both out and err writer? What happens, if while buffering there's an error?

https://github.com/tektoncd/results/blob/8a81833caec8138f08f2566c68b2926920c6f0de/pkg/watcher/reconciler/dynamic/dynamic.go#L407

In the above line is it going to send the error message to the API server instead of the log message?

yep I see what you mean @sayan-biswas .... I will update this to create a separate err buffer on the stack which is used to generate log messages, return errors, as needed

fwiw this bit should not have any bearing on the soak test with this PR that @pmacik started ... ultimately that test is not validating the precise contents of the stored log; only that that update log has completed and that the memory and GC looks OK

thanks

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 63.1% -6.2
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 63.1% -6.2
gabemontero commented 7 months ago

@gabemontero I have one question. Why are we passing same inMemWriteBuffer to both out and err writer? What happens, if while buffering there's an error? https://github.com/tektoncd/results/blob/8a81833caec8138f08f2566c68b2926920c6f0de/pkg/watcher/reconciler/dynamic/dynamic.go#L407

In the above line is it going to send the error message to the API server instead of the log message?

yep I see what you mean @sayan-biswas .... I will update this to create a separate err buffer on the stack which is used to generate log messages, return errors, as needed

fwiw this bit should not have any bearing on the soak test with this PR that @pmacik started ... ultimately that test is not validating the precise contents of the stored log; only that that update log has completed and that the memory and GC looks OK

thanks

initial separation of the error channel complete @sayan-biswas

for now, I'll refrain from posting here any details from the internal slack thread we have around error handling; let's see what initial results we get from the currently running soak tests

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 64.6% -4.7
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 63.1% -6.2
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 63.0% -6.3
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 63.4% -5.9
gabemontero commented 7 months ago

my goodness golint even complains about misspelling words :-)

 pkg/watcher/reconciler/dynamic/dynamic.go:374:5: `reconcilation` is a misspelling of `reconciliation` (misspell)
    // reconcilation in this controller
       ^
make: *** [Makefile:88: golangci-lint-check] Error 1 
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 64.2% -5.1
gabemontero commented 7 months ago

@sayan-biswas @avinal @ramessesii2 @enarha I think I would be good to get this PR in first, to help provide some immediately relief with our deployment, and then I can rebase #715 as needed and after our stress test for that one is done, follow up with getting you all to get that ready for merge

PTAL

gabemontero commented 7 months ago

@sayan-biswas @avinal @ramessesii2 @enarha I think I would be good to get this PR in first, to help provide some immediately relief with our deployment, and then I can rebase #715 as needed and after our stress test for that one is done, follow up with getting you all to get that ready for merge

PTAL

bump @sayan-biswas @avinal per my comment https://github.com/tektoncd/results/pull/715#discussion_r1501847972 let's merge this before #715 to get some immediate relief for the memory leak as well as get some baselines in production environments before we entertain #715

gabemontero commented 7 months ago

I've squashed the commits (there were several temp/debug commits in there)

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 64.2% -5.1
gabemontero commented 7 months ago

/hold cancel

sayan-biswas commented 7 months ago

/approve

tekton-robot commented 7 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: avinal, sayan-biswas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/tektoncd/results/blob/main/OWNERS)~~ [avinal,sayan-biswas] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
sayan-biswas commented 7 months ago

/lgtm