Slowness with large payloads

carl-mastrangelo commented 8 years ago

When reading a proto off the wire, Netty Passes the chunks of data read up to the message deframer, which stores them in a composite byte buffer. It passes this composite as an InputStream to CodedInputStream for decoding, which itself makes many copies.

It seems like it would be more efficient to pre allocate a buffer of the appropriate size (since we know the message length) and concat to it each chunk that netty passes up. This would be a copy, but it would make it so that CIS doesn't have to copy in its inefficient manner.

This would also free up the buffers Netty uses to read chunks of data off of the wire more quickly.

Discovered when trying to max out a very high speed network link.

pgrosu commented 8 years ago

Are the copies necessary, or basically it's just to process the message. If the message just needs to be processed once and not in multiple ways, why not pass-by-reference and not perform the copy. This would require some rewriting but it would eliminate the redundant copying and additional memory allocation. Any transformation would be on the message or object itself.

carl-mastrangelo commented 8 years ago

@pgrosu It isn't that simple, there are tradeoffs, and this isn't really an ideal place to discuss it. This github issue is for tracking it.

If you would like to discuss the approaches available, the best way would be in the user forum here: https://groups.google.com/forum/#!forum/grpc-io

louiscryan commented 8 years ago

Paul,

Knowing if you can pass-by-reference would require knowledge of the application behavior. If the application can guarantee that the message is (a) handled quickly and (b) explicitly released after use then using a zero-copy technique would be a boon. This is something we intend to work on but it will take some time.

In protobuf itself there are several things we can do without introducing zero-copy and we should tackle those first.

Louis

On Tue, Aug 9, 2016 at 4:53 PM, Carl Mastrangelo notifications@github.com wrote:

@pgrosu https://github.com/pgrosu It isn't that simple, there are tradeoffs, and this isn't really an ideal place to discuss it. This github issue is for tracking it.

If you would like to discuss the approaches available, the best way would be in the user forum here: https://groups.google.com/forum/#!forum/grpc-io

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/grpc/grpc-java/issues/2151#issuecomment-238727843, or mute the thread https://github.com/notifications/unsubscribe-auth/AIoKPGvuXJg4Hibinemvy4H5mLWXvHh7ks5qeRMKgaJpZM4Jgmm1 .

carl-mastrangelo commented 8 years ago

@ejona86: running with a larger message quantum did not affect throughput.

pgrosu commented 8 years ago

@carl-mastrangelo I see, though issue https://github.com/grpc/grpc-java/issues/2139 became a nice discussion, but you propose a fantastic idea as we can involve the community effort to figure out ways to optimize netty and the protobuf processing - including any related layer-specific components.

Louis, I agree, but why rely on a potential guarantee when we can gather more minds to explore more possibilities to explore improvements for the implementations. That's why functional programming was invented - let the function pass over the data if that can prove faster.

Again you guys did an amazing job here, but there is a huge collection of individuals in our community and by putting our heads together I'm sure we can explore and come up with beneficial possibilities for the protobuf and netty portions, as well layer-specific components.

buchgr commented 8 years ago

@carl-mastrangelo I think what you are saying makes a lot of sense.

Regarding the message quantum, I assume you are talking about the StreamByteDistributor (or what's it called in the HTTP/2 codec). Did you als up the send and receive buffer and the max buffer size in AdaptiveRecvByteBufAllocator? The idea being to size those a bit larger than the message size, so on a good network or over loopback netty allocates fewer buffers on read (ideally one per message). That's obviously not a fix, but might be interesting to see how throughput is affected if only a single bytebuf is allocated per message, so we know what kind of improvements to expect at most from a fix.

xfxyjwf commented 7 years ago

@pherl

carl-mastrangelo commented 7 years ago

cc: @jayantkolhe

grpc / grpc-java

Slowness with large payloads #2151