icerpc / icerpc-csharp

A C# RPC framework built for QUIC, with bidirectional streaming, first-class async/await, and Protobuf support.
https://docs.icerpc.dev
Apache License 2.0
92 stars 14 forks source link

Improve stream decoding memory allocations #1828

Open bentoi opened 1 year ago

bentoi commented 1 year ago
    In theory, you could decode elements one by one from the payload pipe reader. Right? Afaict, the only reason why we don't do it is to avoid creating a Slice decoder for each element to decode. 

So the idea here would be to find a compromise between too many array/list allocations and too many Slice decoder allocations.

The compromise could be to allocate a 1024 element array or list which is re-used to decode the elements. This way, we have a single array/list allocation and few Slice decoder allocations (which shouldn't be a big issue compared to many allocations for decoding the data).

It would be something like this:

while (true)
{
     ReadResult result = readFunc(payload);
     ReadOnlySequence<byte> buffer = result.Buffer;
     while(!buffer.IsEmpty)
     {
          IAsyncEnumerable<T> items;
          // decodeFunc decodes 1024 elements (or less) and returns a buffer that contains the unconsumed payload
          (items, buffer) = decodeFunc(buffer);
          foreach (T item in items)
          { 
                yield return item;
          }
     } 
     payload.AdvanceTo(result.Buffer.End);  
}

The decodeFunc method uses a fixed size array/list (created once) to decode up to 1024 elements.

_Originally posted by @bentoi in https://github.com/zeroc-ice/icerpc-csharp/pull/1811#discussion_r985474270_

bernardnormier commented 1 year ago

I find the proposed optimization too complicated.

I am also not a fan of baking-in values such as "1024 elements". If the elements are large, 1024 elements could be quite large (and much larger than the max of the underlying transport buffer, by default 64K bytes with Slic) and if the elements are small, this results in tiny arrays.

bentoi commented 1 year ago

If the elements are large, 1024 elements could be quite large (and much larger than the max of the underlying transport buffer, by default 64K bytes with Slic)

I don't understand, a ReadAsync will return a given number of encoded elements, just like today. We will just fill up the array with no more than 1024 elements but it will be less most of the time if elements are large. We don't wait to receive 1024 elements... If we receive 3 elements, we fill up the array with 3 elements and yield 3 elements from the array before reading more.

bernardnormier commented 1 year ago

The size of this array is 1024 * sizeof(element), where sizeof(element) depends on the element. Does not matter if you 1 or 1024 "live" elements in the array.

For example, if size(element) = 100, your array will use 100KB in memory, even if the max number of elements you ever decode together is 3.

bentoi commented 1 year ago

We could use a much smaller value such as 16. Or if it's still to large simply create a Slice decoder for decoding each element. We've optimized the Slice decoder allocation... it's not supposed to be expensive.