gabemontero commented 2 weeks ago

Expected Behavior

Storage of logs succeeds at a high success rate with reasonable performance

Actual Behavior

In a sufficiently sized production system, addressing

the watcher channel/thread leak by not cancelling context properly
but allowing log storage to complete before before properly cancelling the context to free up storage
yet avoid HTTP2 and GRPC throttling which can force log storage to take many MINUTES

is untenable with the current design.

2 months of performance analysis of GRPC and golang's HTTP2 have uncovered repeatedly report issues on the internet about HTTP2 Request Body access not scaling / becoming increadibly slow, as well GRPC level throttling which the current tuning options in the golang GRPC code have yet to resolve.

We have a prototype already that showed improvement wrt the HTTP2 body scaling issue, where we did pod log retrieval in the API server and then stored to S3 there, but it was still slowed by GRPC throttling on the background threads, leading to long delays before logs were actually stored.

The WG has discussed on calls and in slack, where we are close to high level agreement on doing log storage in the watcher, and only update the Log Records on success / failure.

@khrm @enarha @sayan-biswas @avinal FYI

enarha commented 1 week ago

| where we are close to high level agreement on doing log storage in the watcher @gabemontero did you mean the watcher or the api?

And on a different note, since you've already spent a lot of time testing, could you please share what's the easiest way to reproduce the issue, e.g local Kind environment, how and what pipelineruns are you running, etc?

gabemontero commented 4 days ago

| where we are close to high level agreement on doing log storage in the watcher @gabemontero did you mean the watcher or the api?

per recent exchange with @khrm @sayan-biswas I mean the watcher @enarha .... we need to avoid GRPC calls where we obtain / wait on status like we have to do with UpdateLog, and thus need to interact with s3, etc. from our current watcher process. We may have to do a vanilla GRPC get or two for metadata, but and the idea of even bypassing that to scaffold the metadata needed from what is already available in the watcher to engage with the external storage seemed feasible upon the last code review. If not, those vanilla GRPC gets will scale better than having to confirm the S3 storage succeeded on the API server side.

And on a different note, since you've already spent a lot of time testing, could you please share what's the easiest way to reproduce the issue, e.g local Kind environment, how and what pipelineruns are you running, etc?

I think deployment of tekton results is up to the implementor. Turns out deploying on openshift with our bootstrapping of minio and postgresql was easiest for me, so that is what I did. But I'm sure others might find using kind etc. easier.

Between @pmacik and I we had two layers of testing. Mine was a simple pipelinerun and bash script which I'll attach here. That definitely exhibited the http2 body ready bottleneck, which I think would be a sufficient unit test for this.

@pmacik 's test employed a much more sophisticated pipeline used at Red Hat's internal deployment of Results ("Konflux"). In addition to the http2 body bottleneck, he could also exhibit the massive scaling issues with the internal GRPC WriteStatus function which we uncovered after adding debug into the GRPC code in this repo's vendor tree. I'll post a duplication of my internal report in this issue in a separate comment.

Ultimately, once a developer has performance the unit test level validation with my test case, I would suggest the Results WG reach out to @pmacik to see about running his test with a version of these changes. If/when that point is reached, I can point you to the branches our our internal deployment we used for @pmacik's testing.

results-763.zip

gabemontero commented 4 days ago

A copy/paste of analysis I shared internally at RH (with some removal of internal links):

OK verion 3 of the mem leak fix, which still makes GRPC calls to send logs, but does it all thread, with high timeouts, has been running in stage for 10 days.

While it is tolerable performance wise vs. the on thread version, but is still 3X slower, and the success rate is only on par at best (i.e. cancelled context aborts, etc) then the current version on konflux with the mem leak which does not properly handle context.

The upstream WG, including Khurram Baig from openshift pipelines, agrees the function needs to be rewritten. I was able to do a prototype where GRPC was only used as as signal for the API server to read the logs, but while that was an improvement, still hlt GRPC / HTTP2 throttling / bottlenecks in my testing. We all agree now that GRPC needs to be bypassed, and have the watcher access S3 directly. I have opened https://github.com/tektoncd/results/issues/763 upstream to track.

On the performance problems, first, wrt the http2 body read bottleneck, the performance team lined me up with a golang performance engineer who I have been working with in https://redhat-internal.slack.com/archives/C060Z6FG7UK/p1718108397779749 . We have confirmed the http2 body performance is well documented on the web, across other languages, but agreed it investigation at the golang level was warranted. After about a week of testing and customized golang code, we still don't have any data at the golang level. Bottom line, though I think at best we would just confirm what we read on the web across various languanges wrt http2 body access, flow control, etc.

Next, with the continued efforts between myself and Pavel Macik I believe we have made some headway on the 9 plus minute delays he saw with the "GGMGGM16" logs out of https://github.com/gabemontero/results/tree/downstream-http-vs-grpc with the lates round of debug. Contact Pavel (@pmacik) if you need access to his spreadsheet.

The GGMGGM16, GGMGGM20, GGMGGM31, and GGMGGM33 logs all hit the 10 minute mark (600 seconds) in duration when the pipelineruns all complete concurently and write their logs. The basic call stack is GGMGGM16, GGMGGM20, then GGMGGM33, then GGMGGM31.

https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L470 is GGMGGM16
https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L569 is GGMGGM20

https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/server.go#L1735 is GGMGGM33 https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L233 is GGMGGM31

Now, there were no GGMGGM34 logs https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L244 in Pavel's test run.

And that runStream is over at https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L578-L586 where the fn() call at https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L582 maps to various Writes I have instrumented, which then happen on the thread GRPC launches, where it wait for completion.

There are GGMGGM 24 logs from https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L382 .... we are writing data there .... I see as long as 2 minutes there

iteration-0002-0075.tekton-results-api.logs:[pod/tekton-results-api-5b668c456c-s487j/api] GGMGGM24 serverHandlerTransport Write do Write data 2m0.299206724s ts 1718897216.849309923

There are GGMGGM25 logs from https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L389 but a lot of those are in nanoseconds, like

iteration-0002-0075.tekton-results-api.logs:[pod/tekton-results-api-5b668c456c-s487j/api] GGMGGM25 serverHandlerTransport Write do Write data 100ns ts 1718896109.647620007

But I now see a bug where I am printing the Write duration (duration4) vs. the Flush duration (durationa5) ... bottom line, the Flush can take time.

There are no GGMGGM26 through GGMGGM30 logs for processing of the headers. So it is just the body of the response, which gets back to our http2 body access (in this case writes vs. reads).

Also, we don't hit the transport is closed do metrics bit at https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L291-L299 because the server stays up

Lastly, back to the WriteStatus function in i.e. GGMGGM31 and look at the whole function, https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L225-L302

It is single threaded.

That can only mean we are blocked on the mutex at https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L291-L299

i.e. this server when writing status appears SINGLE THREADED !!!

Further proof:

see the comment https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L156-L159 ... the channel gets fed from https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L220 i.e.  the "do" function in https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L238 .... so do function may be short, but they run serialized from the channel
only 1 server per https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/cmd/api/main.go#L225 (that is the GGMGGM31 print per https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/server.go#L1727-L1736
and only 1 server transport per https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/server.go#L1070 for the Server from ^^^
this is why the GRPC thread pooling and thread launching does not help in out testing ... that is outside this single threaded write status

gabemontero commented 4 days ago

Also for reference, the performance issues with GRPC's ServerHTTP appear to be old news. For example, see https://github.com/grpc/grpc-go/issues/586

Results has to use ServerHTTP to provide vanilla HTTP(2) Rest access

Also, see the query https://github.com/grpc/grpc-go/pulls?q=is%3Apr+is%3Aclosed+performance+in%3Atitle for other forays into this topic over the years.

As of this comment, there a no more open issues related to performance.

Lastly, none of the tuning options I found like https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/server.go#L163-L178 (some of which are mentioned in those closed performance issues) provided any benefit our testing.

tektoncd / results

log storage feature needs to be rewritten #763

Expected Behavior

Actual Behavior