microsoft / Microsoft.IO.RecyclableMemoryStream

A library to provide pooling for .NET MemoryStream objects to improve application performance.
MIT License
1.98k stars 205 forks source link

For many small stream writes, RecyclableMemoryStream is slower than other implementations #339

Open mregen opened 7 months ago

mregen commented 7 months ago

For a project which implements its own buffered memory stream I added the int Read(Span<byte> buffer) and void Write(ReadOnlySpan<byte> buffer) signatures and compared the performance of MemoryStream, ArraySegmentStream (custom based on ArrayPool) and RecyclableMemoryStream.

To my surprise, RecyclableMemoryStream is always outperformed by MemoryStream and ArraySegmentStream by a factor of two, as if the Write(ReadOnlyMemory<byte>) is not properly implemented or disabled in RecyclableMemoryStream?

see benchmark results here: https://github.com/OPCFoundation/UA-.NETStandard/pull/2556

see results of BinaryEncoderRecyclableMemoryStream vs BinaryEncoderArraySegmentStream and BinaryEncoderMemoryStream.

Any idea what could be wrong in the benchmark or why RecyclableMemoryStream.Write is always slower?

sgorozco commented 7 months ago

Ben surely has the definitive answer here.

I can share some of my findings - a few months ago I profiled a scenario where multiple little chunks were written to the stream by a BinaryWriter instance. The profiler signaled as the hottest path, the logic inside RecyclableMemoryStream that determines which reusable memory block(s) should receive the write operations, which in essence, involved a division to find the block, followed by a modulus operation to find the offset in the block (both operations are relatively expensive, even on modern CPUs). At a later revision, this was modified to take advantage of a Math.DivRem operation that apparently gives the division and the reminder on a single operation; unfortunately, this is not as fast as it could be - behind the scenes, some versions of the runtime still perform a division operation followed by a modulus operation, while others are a little bit faster, as they compute the reminder by performing a subtraction and a multiplication. There is an open request for an optimization (e.g. in x86, the IDIV operation returns the quotient and the reminder on a single operation), however it seems that the design of the IL jitter makes taking both results difficult to implement). Here is the link to that request: Can "Math.DivRem" do division operation only once? · Issue #4155 · dotnet/runtime (github.com) https://github.com/dotnet/runtime/issues/4155.

Perhaps this could be optimized for the case of sequential writes (by tracking the current write position and the reminding available bytes?), but in my use cases, the relative "slowness" of the write operations is more than compensated by the massive drop in GC pressure.

Our internal RPC framework virtually became allocation-free (as well as copy-free, as packet interleaving for channel multiplexing is directly writing the recyclable blocks to the socket) on both ends, thanks to RecyclableMemoryStream.

On Wed, Mar 27, 2024 at 2:43 AM Martin Regen @.***> wrote:

For a project which implements its own buffered memory stream I added the int Read(Span buffer) and void Write(ReadOnlySpan buffer) signatures and compared the performance of MemoryStream, ArraySegmentStream (custom based on ArrayPool) and RecyclableMemoryStream.

To my surprise, RecyclableMemoryStream is always outperformed by MemoryStream and ArraySegmentStream by a factor of two, as if the Write(ReadOnlyMemory) is not properly implemented or disabled in RecyclableMemoryStream?

see benchmark results here: OPCFoundation/UA-.NETStandard#2556 https://github.com/OPCFoundation/UA-.NETStandard/pull/2556

see results of BinaryEncoderRecyclableMemoryStream vs BinaryEncoderArraySegmentStream and BinaryEncoderMemoryStream.

Any idea what could be wrong in the benchmark or why RecyclableMemoryStream.Write is always slower?

— Reply to this email directly, view it on GitHub https://github.com/microsoft/Microsoft.IO.RecyclableMemoryStream/issues/339, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRR7UQTHKOCAOOBKITDSTLY2J2BNAVCNFSM6AAAAABFKNPV62VHI2DSMVQWIX3LMV43ASLTON2WKOZSGIYTAMJZGY4TSNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mregen commented 7 months ago

Thanks @sgorozco, this is in fact whats happening here, a BinaryWriter writes a lot of little chunks (e.g. int, byte, etc) for a OPC UA binary encoder. The Write implementation should only copy the small chunks into the buffer, keep two indexes (one for the actual buffer, one for the write pointer into the actual buffer). An overflow may split the write and allocates a new buffer or moves to the next buffer. Thats it. I would be interested to see why a division is necessary.

Update -- I checked:

--> GetBlockAndRelativeOffset uses a division.

I guess its a design decision to make write faster or the calculation of the position, or what to use as the source of truth. My expectation were that the Write is almost as fast as the Write in the MemoryStream base class. Wishful thinking :-)

mregen commented 7 months ago

As an additional sidenote, turns out that in my JSONEncoder Tests where writing uses StreamWriter in contrary to BinaryWriter above, RecyclableMemoryStream outperforms all other stream implementations... this is currently our main use case so technically I'm a happy camper with the current implementation.

benmwatson commented 7 months ago

Yeah, the issue with lots of small writes is something that has come up a few times before. Unfortunately, it just wasn't optimized for that scenario. The workaround would be to operate directly on the buffer, but not all APIs will support that, and that does defeat the purpose of using a stream.

A way to fix this would be to change the source of truth for the current position. Currently, we store a long position, that we then convert to block/offset on every necessary operation, which involves, as you noted, two divisions. We could change this to track the block/offset instead, and then when we need the position, do the multiply+add, which should be much faster. Alternatively, we could just track both values and keep them in sync, but I suspect this doesn't offer any additional benefit (we'd be doing the multiply+add anyway, or a lot of individual adds).

I'd have to think about any negative performance ramifications (e.g., calling Position is much more expensive), but perhaps it's not so bad. I'll ponder it.