Closed gabemontero closed 2 months ago
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 55.7% | -13.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 55.7% | -13.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 55.4% | -14.0 |
@sayan-biswas @avinal @enarha @ramessesii2 @adambkaplan FYI a first pass at replacing tkn cli with generic k8s get pod logs API calls
I could see this replacing #712 after a little more refining.... but curious on opinions / feedback
also really close to getting rid of launching the separate goroutine in sendLogs, but could also see deferring that to a follow pull request that also dives into verifying UpdateLogs completed in the API server to completely remove the intermittent canceled context UpdateLog instances we are seeing in Pavel's soak test. But I took a first pass at that today and that is going to require some unit test overall, as I've discovered the unit tests are not fully wired to return fake data.
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 55.7% | -13.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 61.7% | -7.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 61.5% | -7.9 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 61.5% | -7.9 |
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: sayan-biswas
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.
I just have one comment on the log format.
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid ***
TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff
PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names ***
Task name: test-pipelinerun-dd25b-task1
TaskRun name: test-pipelinerun-dd25b-task1
Pipeline name from params: test-pipelinerun-dd25b
PipelineRun name from params: test-pipelinerun-dd25b
Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.
I just have one comment on the log format.
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid *** TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10 *** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names *** Task name: test-pipelinerun-dd25b-task1 TaskRun name: test-pipelinerun-dd25b-task1 Pipeline name from params: test-pipelinerun-dd25b PipelineRun name from params: test-pipelinerun-dd25b
@sayan-biswas is your question here ^^ separate from the one in https://github.com/tektoncd/results/pull/715#discussion_r1500558338 ?
If it is, apologies I'm missing what you are getting at, and ask that you elaborate either here or in slack/team meetings. Thanks.
@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.
@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.
OK thanks for confirming the comments refer to the same core notion @sayan-biswas
As to memory footprint, I am waiting to hear back from Pavel on when he can start a smoke test, so not yet.
This is a factor as to why I want to merge #712 first. We have validated the memory footprint is fixed there. Also, I want to see how this change looks in our prod env for a bit with more worker threads, particularly with the workqueue depth, so as to potentially compare with this change.
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 56.1% | -13.2 |
/hold
merge after #712, but let's review get it ready to merge in parallel
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 56.7% | -12.6 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 55.0% | -14.3 |
/retest
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 54.7% | -14.6 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 53.7% | -15.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 53.7% | -15.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 53.7% | -15.7 |
/retest
Tests green at the moment with the tkn formatting, but there does seem to be some sort of intermittent flake in the vanilla e2e-tests I have not seen before today and the change to do tkn formatting.
=== RUN TestPipelineRun/check_annotations
e2e_test.go:238: Error waiting for PipelineRun creation: timed out waiting for the condition
=== RUN TestPipelineRun/check_deletion
e2e_test.go:254: Error waiting for PipelineRun deletion: timed out waiting for the condition
=== RUN TestPipelineRun/check_result
e2e_test.go:260: Result name not found
=== RUN TestPipelineRun/check_record
e2e_test.go:270: Record name not found
=== RUN TestPipelineRun/result_data_consistency
e2e_test.go:283: rpc error: code = InvalidArgument desc = rpc error: code = InvalidArgument desc = name must match (^[a-z0-9_-]{1,63})/results/([a-z0-9_-]{1,63}$)
--- FAIL: TestPipelineRun (120.07s)
--- FAIL: TestPipelineRun/check_annotations (60.02s)
--- FAIL: TestPipelineRun/check_deletion (60.02s)
--- SKIP: TestPipelineRun/check_result (0.00s)
--- SKIP: TestPipelineRun/check_record (0.00s)
--- FAIL: TestPipelineRun/result_data_consistency (0.00s)
let's review that change all the same @sayan-biswas , but I'll add some debug to that base e2e, see if I can uncover anything.
@gabemontero This is because the polling time set is too low. So before the TaskRun/PipelineRun is finishing, it is getting timeout and hence the test is failing. Sometimes the cluster is slow, so this happens intermittently.
https://github.com/tektoncd/results/blob/main/test%2Fe2e%2Fe2e_test.go#L145
@gabemontero This is because the polling time set is too low. So before the TaskRun/PipelineRun is finishing, it is getting timeout and hence the test is failing. Sometimes the cluster is slow, so this happens intermittently.
https://github.com/tektoncd/results/blob/main/test%2Fe2e%2Fe2e_test.go#L145
gotcha thanks for the history/insight and detail
if you want me to bump those various PollImmediate calls in that test as part of this, let me know
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 55.8% | -8.4 |
rebase basically OK, but
I think there was a hiccup on the go mod download which impacted golint with the build-tests
re-running go mod tidy
and go mod vendor
produced no new changes in my branch
let's see what happens with
/retest
rebase basically OK, but
I think there was a hiccup on the go mod download which impacted golint with the build-tests
re-running
go mod tidy
andgo mod vendor
produced no new changes in my branchlet's see what happens with
/retest
flake confirmed
e2e's green
@sayan-biswas @avinal per our video conf today, feel free to provide feedback for changes here, or lgtm, as your availability allows. In the mean time, we'll start testing #712 on our end, and gather some base lines with increased worker thread pool, as well as revalidate the memory leak fix we confirmed in our performance stress test. Along with hopefully getting a performance stresss test with these changes soon, in order to helpe validate the merging of this path we all agree is better long term.
I'll remove the hold once all ^^ is sorted out.
thanks
So @pmacik's initial soak testing appears to uncover a new version of the goroutine leak (grpc client goroutines).
I suspect it is not too dissimilar what was was fixed with #712, but with the removal of the tkn client thread/channel/context machinations, we've fallen back into the warnings around https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream
@pmacik's pprof data showed a lot of parked threads in one of the new connection goroutines waiting on context's be done.
I've started working on an update to this PR that both:
However, I appear to be hitting flakiness with the grpc unit test alternatives for file storage, even when stepping through the code. I can see repeated GetLogs fail when the data is clearly on my local disk.
When I next pick this up, I'll see about ignoring these errors in the unit tests, and see how the PR e2e's using kind do with everything.
@avinal @sayan-biswas @ramessesii2 FYI
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 55.3% | -8.9 |
flake on kind install of results in integration test ... trying again
/retest
/retest
OK @sayan-biswas @avinal @ramessesii2 the switch from CloseSend
to CloseRecv
(which calls CloseSend under the covers and then does a Recv on a LogSummary to apparently confirm the operation was successful) passed e2e's.
I am next going to attempt to re-introduce a child context with timeout and calling of its cancel to hopefully 1) fully clean up the grpc goroutines currently blocked on context that Pavel has observed in the smoke test 2) but not cancel the context prematurely such that the UpdateLog call is interrupted with 'canceled context'
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 55.7% | -8.5 |
OK success @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm with https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/715/pull-tekton-results-integration-tests/1763567949735530496
This PR now:
newClientStreamWithParams
alone) that get created as part of making the UpdateLog
call to the results API server .... we'll need to confirm this with the next (and hopefully final for now) soak test round with @pmacikCloseSend
to CloseAndRecv
(which calls CloseSend
under the covers, but then makes a Recv
call on the LogSummary
obj which at least with the integration tests seem to confirm the UpdateLog
completes before our cancel func cleans things up; when we only used CloseSend
the UpdateLog
call would get aborted with a canceled context error
side bar - @khrm I'll be curious how your viper POC mirror / overlaps with the grpc machinations with the results api server
I'll work out with @pmacik on getting the next soak test run started, but ideally in parallel, let's get the review going here, assuming any requested changes are minor and don't alter the overall structure noted above in this comment.
Thanks
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 55.7% | -8.5 |
OK success @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm with https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_results/715/pull-tekton-results-integration-tests/1763567949735530496
This PR now:
* employs k8s pod log APIs with no goroutines launched by the watcher to allow for error inspection and intelligent retry * but employs tkn style line by line prefixing * it employs a child context with timeout and a call to its cancel func which _should_ cleanup all the grpc client goroutines (at least 4 by `newClientStreamWithParams` alone) that get created as part of making the `UpdateLog` call to the results API server .... we'll need to confirm this with the next (and hopefully final for now) soak test round with @pmacik * but the switch from `CloseSend` to `CloseAndRecv` (which calls `CloseSend` under the covers, but then makes a `Recv` call on the `LogSummary` obj which at least with the integration tests seem to confirm the `UpdateLog` completes before our cancel func cleans things up; when we only used `CloseSend` the `UpdateLog` call would get aborted with a `canceled context error`
side bar - @khrm I'll be curious how your viper POC mirror / overlaps with the grpc machinations with the results api server
I'll work out with @pmacik on getting the next soak test run started, but ideally in parallel, let's get the review going here, assuming any requested changes are minor and don't alter the overall structure noted above in this comment.
Thanks
good news @sayan-biswas @avinal @ramessesii2 @adambkaplan @pmacik @khrm the soak test completed over the weekend and both
let's please go ahead and review and get this PR ready for merge. Aside from fixing the intermittent canceled context errors, this change also allows us to better retry on retryable errrors.
thanks
/hold cancel
update PR description, issue fixed, release note, to address latest findings
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 55.7% | -8.5 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 55.3% | 51.2% | -4.1 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 55.3% | 51.2% | -4.1 |
/hold
still waiting to see in our prod env if this switch is necessary based on how the current tkn based solution performs
@gabemontero: PR needs rebase.
rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
Changes
Fixes https://github.com/tektoncd/results/issues/720
alternative to https://github.com/tektoncd/results/pull/712 with regard to still addressing the goroutine/memory leak, swtiching from the tkn client to k8s client, but with the reduction of threads, allows for retry when appropriate on UpdateLog errors.
/kind bug
These changes address a goroutine/memory leak that existed with the prior main branch vesrion,
Additionally
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you review them:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes