tektoncd / results

Long term storage of execution results.
Apache License 2.0
77 stars 73 forks source link

switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test #715

Closed gabemontero closed 2 months ago

gabemontero commented 7 months ago

rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

Changes

Fixes https://github.com/tektoncd/results/issues/720

alternative to https://github.com/tektoncd/results/pull/712 with regard to still addressing the goroutine/memory leak, swtiching from the tkn client to k8s client, but with the reduction of threads, allows for retry when appropriate on UpdateLog errors.

/kind bug

These changes address a goroutine/memory leak that existed with the prior main branch vesrion,

Additionally

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

Release Notes

addresses UpdateLog calls from the watcher failing with canceled context, and allows for proper erro retry, while still fixing goroutine leak, which leads to overall memory leak, when storing logs
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.7% -13.7
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.7% -13.7
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.4% -14.0
gabemontero commented 7 months ago

@sayan-biswas @avinal @enarha @ramessesii2 @adambkaplan FYI a first pass at replacing tkn cli with generic k8s get pod logs API calls

I could see this replacing #712 after a little more refining.... but curious on opinions / feedback

also really close to getting rid of launching the separate goroutine in sendLogs, but could also see deferring that to a follow pull request that also dives into verifying UpdateLogs completed in the API server to completely remove the intermittent canceled context UpdateLog instances we are seeing in Pavel's soak test. But I took a first pass at that today and that is going to require some unit test overall, as I've discovered the unit tests are not fully wired to return fake data.

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.7% -13.7
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 61.7% -7.7
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 61.5% -7.9
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 61.5% -7.9
tekton-robot commented 7 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sayan-biswas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/tektoncd/results/blob/main/OWNERS)~~ [sayan-biswas] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
sayan-biswas commented 7 months ago

Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.

I just have one comment on the log format.

*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid ***
TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff
PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names ***
Task name: test-pipelinerun-dd25b-task1
TaskRun name: test-pipelinerun-dd25b-task1
Pipeline name from params: test-pipelinerun-dd25b
PipelineRun name from params: test-pipelinerun-dd25b
gabemontero commented 7 months ago

Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.

I just have one comment on the log format.

*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid ***
TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff
PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names ***
Task name: test-pipelinerun-dd25b-task1
TaskRun name: test-pipelinerun-dd25b-task1
Pipeline name from params: test-pipelinerun-dd25b
PipelineRun name from params: test-pipelinerun-dd25b

@sayan-biswas is your question here ^^ separate from the one in https://github.com/tektoncd/results/pull/715#discussion_r1500558338 ?

If it is, apologies I'm missing what you are getting at, and ask that you elaborate either here or in slack/team meetings. Thanks.

sayan-biswas commented 7 months ago

@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.

gabemontero commented 7 months ago

@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.

OK thanks for confirming the comments refer to the same core notion @sayan-biswas

As to memory footprint, I am waiting to hear back from Pavel on when he can start a smoke test, so not yet.

This is a factor as to why I want to merge #712 first. We have validated the memory footprint is fixed there. Also, I want to see how this change looks in our prod env for a bit with more worker threads, particularly with the workqueue depth, so as to potentially compare with this change.

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 56.1% -13.2
gabemontero commented 7 months ago

/hold

merge after #712, but let's review get it ready to merge in parallel

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 56.7% -12.6
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.0% -14.3
gabemontero commented 7 months ago

/retest

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 54.7% -14.6
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 53.7% -15.7
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 53.7% -15.7
tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 53.7% -15.7
gabemontero commented 7 months ago

/retest

gabemontero commented 7 months ago

Tests green at the moment with the tkn formatting, but there does seem to be some sort of intermittent flake in the vanilla e2e-tests I have not seen before today and the change to do tkn formatting.

 === RUN   TestPipelineRun/check_annotations
    e2e_test.go:238: Error waiting for PipelineRun creation: timed out waiting for the condition
=== RUN   TestPipelineRun/check_deletion
    e2e_test.go:254: Error waiting for PipelineRun deletion: timed out waiting for the condition
=== RUN   TestPipelineRun/check_result
    e2e_test.go:260: Result name not found
=== RUN   TestPipelineRun/check_record
    e2e_test.go:270: Record name not found
=== RUN   TestPipelineRun/result_data_consistency
    e2e_test.go:283: rpc error: code = InvalidArgument desc = rpc error: code = InvalidArgument desc = name must match (^[a-z0-9_-]{1,63})/results/([a-z0-9_-]{1,63}$)
--- FAIL: TestPipelineRun (120.07s)
    --- FAIL: TestPipelineRun/check_annotations (60.02s)
    --- FAIL: TestPipelineRun/check_deletion (60.02s)
    --- SKIP: TestPipelineRun/check_result (0.00s)
    --- SKIP: TestPipelineRun/check_record (0.00s)
    --- FAIL: TestPipelineRun/result_data_consistency (0.00s) 

let's review that change all the same @sayan-biswas , but I'll add some debug to that base e2e, see if I can uncover anything.

sayan-biswas commented 7 months ago

@gabemontero This is because the polling time set is too low. So before the TaskRun/PipelineRun is finishing, it is getting timeout and hence the test is failing. Sometimes the cluster is slow, so this happens intermittently.

https://github.com/tektoncd/results/blob/main/test%2Fe2e%2Fe2e_test.go#L145

gabemontero commented 7 months ago

@gabemontero This is because the polling time set is too low. So before the TaskRun/PipelineRun is finishing, it is getting timeout and hence the test is failing. Sometimes the cluster is slow, so this happens intermittently.

https://github.com/tektoncd/results/blob/main/test%2Fe2e%2Fe2e_test.go#L145

gotcha thanks for the history/insight and detail

if you want me to bump those various PollImmediate calls in that test as part of this, let me know

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.8% -8.4
gabemontero commented 7 months ago

rebase basically OK, but

I think there was a hiccup on the go mod download which impacted golint with the build-tests

re-running go mod tidy and go mod vendor produced no new changes in my branch

let's see what happens with

/retest

gabemontero commented 7 months ago

rebase basically OK, but

I think there was a hiccup on the go mod download which impacted golint with the build-tests

re-running go mod tidy and go mod vendor produced no new changes in my branch

let's see what happens with

/retest

flake confirmed

e2e's green

@sayan-biswas @avinal per our video conf today, feel free to provide feedback for changes here, or lgtm, as your availability allows. In the mean time, we'll start testing #712 on our end, and gather some base lines with increased worker thread pool, as well as revalidate the memory leak fix we confirmed in our performance stress test. Along with hopefully getting a performance stresss test with these changes soon, in order to helpe validate the merging of this path we all agree is better long term.

I'll remove the hold once all ^^ is sorted out.

thanks

gabemontero commented 7 months ago

So @pmacik's initial soak testing appears to uncover a new version of the goroutine leak (grpc client goroutines).

I suspect it is not too dissimilar what was was fixed with #712, but with the removal of the tkn client thread/channel/context machinations, we've fallen back into the warnings around https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream

@pmacik's pprof data showed a lot of parked threads in one of the new connection goroutines waiting on context's be done.

I've started working on an update to this PR that both:

However, I appear to be hitting flakiness with the grpc unit test alternatives for file storage, even when stepping through the code. I can see repeated GetLogs fail when the data is clearly on my local disk.

When I next pick this up, I'll see about ignoring these errors in the unit tests, and see how the PR e2e's using kind do with everything.

@avinal @sayan-biswas @ramessesii2 FYI

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.3% -8.9
gabemontero commented 7 months ago

flake on kind install of results in integration test ... trying again

/retest

gabemontero commented 7 months ago

/retest

gabemontero commented 7 months ago

OK @sayan-biswas @avinal @ramessesii2 the switch from CloseSend to CloseRecv (which calls CloseSend under the covers and then does a Recv on a LogSummary to apparently confirm the operation was successful) passed e2e's.

I am next going to attempt to re-introduce a child context with timeout and calling of its cancel to hopefully 1) fully clean up the grpc goroutines currently blocked on context that Pavel has observed in the smoke test 2) but not cancel the context prematurely such that the UpdateLog call is interrupted with 'canceled context'

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.7% -8.5
gabemontero commented 7 months ago

OK success @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm with https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/715/pull-tekton-results-integration-tests/1763567949735530496

This PR now:

side bar - @khrm I'll be curious how your viper POC mirror / overlaps with the grpc machinations with the results api server

I'll work out with @pmacik on getting the next soak test run started, but ideally in parallel, let's get the review going here, assuming any requested changes are minor and don't alter the overall structure noted above in this comment.

Thanks

tekton-robot commented 7 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.7% -8.5
gabemontero commented 7 months ago

OK success @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm with https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/715/pull-tekton-results-integration-tests/1763567949735530496

This PR now:

* employs k8s pod log APIs with no goroutines launched by the watcher to allow for error inspection and intelligent retry

* but employs tkn style line by line prefixing

* it employs a child context with timeout and a call to its cancel func which _should_ cleanup all the grpc client goroutines (at least 4 by `newClientStreamWithParams` alone) that get created as part of making the `UpdateLog` call to the results API server .... we'll need to confirm this with the next (and hopefully final for now) soak test round with @pmacik

* but the switch from `CloseSend` to `CloseAndRecv` (which calls `CloseSend` under the covers, but then makes a `Recv` call on the `LogSummary` obj which at least with the integration tests seem to confirm the `UpdateLog` completes before our cancel func cleans things up;  when we only used `CloseSend` the `UpdateLog` call would get aborted with a `canceled context error`

side bar - @khrm I'll be curious how your viper POC mirror / overlaps with the grpc machinations with the results api server

I'll work out with @pmacik on getting the next soak test run started, but ideally in parallel, let's get the review going here, assuming any requested changes are minor and don't alter the overall structure noted above in this comment.

Thanks

good news @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm the soak test completed over the weekend and both

let's please go ahead and review and get this PR ready for merge. Aside from fixing the intermittent canceled context errors, this change also allows us to better retry on retryable errrors.

thanks

gabemontero commented 7 months ago

/hold cancel

update PR description, issue fixed, release note, to address latest findings

tekton-robot commented 6 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.7% -8.5
tekton-robot commented 5 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 55.3% 51.2% -4.1
tekton-robot commented 5 months ago

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 55.3% 51.2% -4.1
gabemontero commented 5 months ago

/hold

still waiting to see in our prod env if this switch is necessary based on how the current tkn based solution performs

tekton-robot commented 3 months ago

@gabemontero: PR needs rebase.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.