grpc / grpc-go

The Go language implementation of gRPC. HTTP/2 based RPC
https://grpc.io
Apache License 2.0
20.95k stars 4.35k forks source link

Infinite loop in bufWriter.Write() #7389

Closed veshij closed 1 month ago

veshij commented 3 months ago

What version of gRPC are you using?

v1.63.2

What version of Go are you using (go version)?

1.19.9

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

We observed instances of an application with higher-than-expected CPU usage. Upon investigation following profile was collected:

Screenshot 2024-07-03 at 4 45 02 PM

Corresponding stack:

2 @ 0xaae73a 0xaae452 0xa597e2 0xa5ad6a 0xa8eee7 0xa8ed32 0xa8ccb9 0xaa43b4 0x5b5dc1
#       0xaae739        google.golang.org/grpc/internal/transport.(*bufWriter).flushKeepBuffer+0x139    external/org_golang_google_grpc/internal/transport/http_util.go:355
#       0xaae451        google.golang.org/grpc/internal/transport.(*bufWriter).Write+0x251              external/org_golang_google_grpc/internal/transport/http_util.go:338
#       0xa597e1        golang.org/x/net/http2.(*Framer).endWrite+0xc1                                  external/org_golang_x_net/http2/frame.go:366
#       0xa5ad69        golang.org/x/net/http2.(*Framer).WriteDataPadded+0x389                          external/org_golang_x_net/http2/frame.go:685
#       0xa8eee6        golang.org/x/net/http2.(*Framer).WriteData+0x586                                external/org_golang_x_net/http2/frame.go:643
#       0xa8ed31        google.golang.org/grpc/internal/transport.(*loopyWriter).processData+0x3d1      external/org_golang_google_grpc/internal/transport/controlbuf.go:973
#       0xa8ccb8        google.golang.org/grpc/internal/transport.(*loopyWriter).run+0xb8               external/org_golang_google_grpc/internal/transport/controlbuf.go:558
#       0xaa43b3        google.golang.org/grpc/internal/transport.NewServerTransport.func2+0xf3         external/org_golang_google_grpc/internal/transport/http2_server.go:335

We suspect that under certain conditions the following code might end up in infinite loop:

    for len(b) > 0 {
        nn := copy(w.buf[w.offset:], b)
        b = b[nn:]
        w.offset += nn
        n += nn
        if w.offset >= w.batchSize {
            err = w.flushKeepBuffer()
        }
    }

Conditions to get into infinite loop state:

It seems to be safe to terminate the for len(b) > 0 cycle on first received error from flushKeepBuffer().

diff --git a/internal/transport/http_util.go b/internal/transport/http_util.go
index 39cef3bd..5258d386 100644
--- a/internal/transport/http_util.go
+++ b/internal/transport/http_util.go
@@ -335,7 +335,9 @@ func (w *bufWriter) Write(b []byte) (n int, err error) {
        w.offset += nn
        n += nn
        if w.offset >= w.batchSize {
-           err = w.flushKeepBuffer()
+           if err = w.flushKeepBuffer(); err != nil {
+               return n, err
+           }
        }
    }
    return n, err
arjan-bal commented 3 months ago

@veshij thank you for reporting this issue! It does seem like the situation you mentioned will always arise when w.conn.Write(w.buf[:w.offset]) returns an error and there is more data to write than the buffer capacity.

Are you interested in contributing a PR with the fix that also contains a unit test? If not, I'll pick it up.

veshij commented 3 months ago

I can take a look, not sure how to implement test yet.

Oleg Guba

Чт, 4 июля 2024 г. в 09:33, Arjan Singh Bal @.***>:

Assigned #7389 https://github.com/grpc/grpc-go/issues/7389 to @veshij https://github.com/veshij.

— Reply to this email directly, view it on GitHub https://github.com/grpc/grpc-go/issues/7389#event-13397033814, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOURUMKSH5U4DIJDJDSYYTZKV2NPAVCNFSM6AAAAABKKPZOGOVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGM4TOMBTGM4DCNA . You are receiving this because you were assigned.Message ID: @.***>

arjan-bal commented 3 months ago

You can unit test only the bufWriter struct by passing it a conn that returns an error. Then call buffWriter.Write() with a sufficiently large buffer. If the test gets stuck in an infinite loop as we suspect in the current state, and returns error with the fix made, I think it should be good enough.

arjan-bal commented 3 months ago

Possible root cause is this change from 6 years ago: https://github.com/grpc/grpc-go/pull/2092

veshij commented 3 months ago

You can unit test only the bufWriter struct by passing it a conn that returns an error. Then call buffWriter.Write() with a sufficiently large buffer. If the test gets stuck in an infinite loop as we suspect in the current state, and returns error with the fix made, I think it should be good enough.

thank you, working on the PR

veshij commented 3 months ago

Created https://github.com/grpc/grpc-go/pull/7394/files Noticed that bytes written (n) returned in case of an error is incorrect in existing implementation, but probably worth fixing in a separate PR.

github-actions[bot] commented 2 months ago

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.