nats-io / nats.net

Full Async C# / .NET client for NATS
https://nats-io.github.io/nats.net
Apache License 2.0
252 stars 51 forks source link

Reconsider serialization approach #137

Closed jasper-d closed 11 months ago

jasper-d commented 1 year ago

What motivated this proposal?

Currently, publish and subscribe API is rather opaque in terms of serialization.

When publishing some already serialized data (e.g. byte[], Memory<byte>, ReadOnlyMemory<byte>), clients have the following options:

  1. Pass byte[] as data to PublishAsync<T>()
  2. Wrap binary data in ReadOnlySequence<byte> and pass it to PublishAsync()
  3. Pass binary data to PublisAsync<T>() and provide a custom serializer in NatsPubOpts that can handle the type.

The problem here is that code such as follows compiles, but fails at runtime:

public async Task Foo(NatsConnection conn){
    Memory<byte> bytes = new byte[32];
    await conn.PublishAsync("some.subject", bytes);
}

That is because the generic overload of PublishAsync is chosen which then calls into NatsJsonSerializer which doesn't handle [ReadOnly]Memory<byte>. byte[] works because the JSON serializer apparently handles it, but it is at least weird to call into a JSON serializer to copy some byte array.

See https://github.com/nats-io/nats.net.v2/pull/136 for illustration of the problem.

What is the proposed change?

I think there are two options to improve the situation here:

  1. Provide overloads for commonly used types that would be preferred during overload resolution, i.e.
    public interface INatsConnection {
    +    ValueTask PublishAsync(string subject, byte[] payload, NatsHeaders? headers = default, string? replyTo = default, NatsPubOpts? opts = default, CancellationToken cancellationToken = default);
    +    ValueTask PublishAsync(string subject, Memory<byte> payload, NatsHeaders? headers = default, string? replyTo = default, NatsPubOpts? opts = default, CancellationToken cancellationToken = default);
    +    ValueTask PublishAsync(string subject, ReadOnlyMemory<byte> payload, NatsHeaders? headers = default, string? replyTo = default, NatsPubOpts? opts = default, CancellationToken cancellationToken = default);
    }
  2. Distinguish the generic PublishAsync overload by forcing clients to provide an ISerializer<T> explicitly, remove NatsPubOpts.Serializer, and have a single overload which handles byte[] and [ReadOnly]Memory<byte>, i.e.

public record NatsPubOpts {

public interface INatsConnection {

Personally, I'd prefer 2., since it avoids overload resolution issues in the future and makes it explicit that a serializer is required. It is also more AOT friendly. NB: I have never really understood the motivation for having a connection wide serializer in NATS clients.

Subscriptions could use a similar approach.

Who benefits from this change?

Users wouldn't observe runtime exceptions for code that compiles. Possibly better performance when not invoking JSON serializer for byte[].

What alternatives have you evaluated?

No response

mtmk commented 1 year ago

This is excellent @jasper-d we were discussing serialization of bytes types in slack recently. (If you're not already on we'd love you to have you on https://slack.nats.io #dotnet channel btw)

We've been also discussing with @caleblloyd as well and I think we're pretty much on the same page with you with your option 2. Some questions if I may add to the discussion:

It's be great if we can have a good discussion as a community and reach a consensus.

caleblloyd commented 1 year ago

Was thinking something along the lines of this-

Get rid of:

Make a BytesSerialzier that handles ReadOnlySequence<byte>, ReadOnlyMemory<byte>, byte[]

This way any of the Byte types will work without having to add an overload for each specific byte type.

For chainable serializers, was thinking something like this:

Data Format Serailzier
Bytes and JSON (default) new BytesSerialzier(new JSONSerializer())
Bytes and Strings new BytesSerialzier(new StringSerializer())
Just Bytes new BytesSerialzier()
mnmr commented 1 year ago

I think there are two main design goals here:

  1. Doing what users expect (that is, only serialize things that need to be)
  2. Performance (avoid work/allocations that doesn't need doing)

[1] can be solved with overloads (targeting more types and/or using distinct method names). I'm not sure I like the idea of having bytes passing through a serializer to produce bytes, even if it is a no-op. It'll also be non-obvious (from the method signature) that it'll actually work.

I'd vote to keep the Serializer option on NatsOps, as that has better ergonomics and discoverability than having to set or pass it separately. I'm also not convinced that there is a use case where someone would want to change their serializer for a connection after creating it (but if it does they can easily create a new connection instead).

[2] As for performance, I think it is important to try to guide folks in the right direction. If the underlying code consuming the data has a preference (span, sequence, array, stream) that will be faster or less allocatey to use, then it should be easy for people to discover that this exists.

That may not be easy if all the overloads have the same method name, but it can be mitigated with docs and code comments (for intellisense). However, if different names are acceptable it becomes possible to group them by behavior:

This makes it obvious that the methods have different behaviors (names can likely be improved though).

Another option entirely, is to use a factory or builder to create the NatsConnection. Then if people specify .WithSerializer(...) they get a SerializingNatsConnection with appropriate methods and otherwise just get the raw connection without the generic overloads for PublishAsync. This is similar to how Pulsar does it (where you can specify a "schema" for your data).

caleblloyd commented 1 year ago

Overloading works on the Publish methods because a param types can be overloaded. It does not work for Subscribe or derivatives like Consume though because the only difference is the return type.

So if we want to support the same byte types consistently on Publish and Subscribe in a similar manner, it will either need to be done with Separate Method names or via Generics for all of these methods:

caleblloyd commented 1 year ago

Some good reading for potential byte return types in addition to byte[] that folks may want to use from Subscribe/Consume:

https://learn.microsoft.com/en-us/dotnet/standard/memory-and-spans/memory-t-usage-guidelines

mnmr commented 1 year ago

So if we want to support the same byte types consistently on Publish and Subscribe in a similar manner, it will either need to be done with Separate Method names or via Generics for all of these methods:

I don't really have a preference here, as long as the end result is performant and does the expected thing :)

mtmk commented 1 year ago

see also https://github.com/nats-io/nats.net.v2/pull/140#issuecomment-1747307102

jasper-d commented 1 year ago

Sorry, I'm slow to respond. And apologies, this is going to become a wall of text.

Do we want to remove the default serializer from the connection? Would that cause inconvenience for applications where there is only one serializer used especially greenfield apps.

Given that I never used the encoded connection in nats.net v1, I would vote for that. I find the global serializer is a extremely hard to discover API (it's not apparent from IntelliSense that it even exists). It fails at runtime instead of compile time because it's not type safe, and it does not compose when one needs to support more than one message type or add support for more types later on.

In addition, I don't think that the currently used default serializer can be made trim safe (c.f. #92).

Should we introduce a method on the connection to swap the serializer when needed e.g. INatsConnection.WithSerializer(INatsSerializer)?

I see great potential for concurrency issues and hard to debug bugs.

new BytesSerialzier(new StringSerializer())

What encoding would StringSerializer use?


I made some changes (incomplete, but core and js compile), just to get a better feeling for possible API changes.

Serialization:

Serialization is done using Action<T, IBufferWriter<byte> because I believe that has the best ergonomics. It doesn't compile if no serializer is given and IntelliSense/API docs will immediately tell a user to pass a serializer. It also works quite well with a number of serialization implementations and avoids boilerplate.

E.g. when serializing Protobuf (using Google's implementation) one would just need to pass (obj, bw) => obj.WriteTo(bw).

String serialization would be just as simple: (str, bw) => Encoding.ASCII.GetBytes(str, bw).

S.T.J is slightly more involved because of the disposable Utf8JsonWriter and the three parameter it needs (https://github.com/nats-io/nats.net.v2/compare/main...jasper-d:jasper-d/serialization#diff-d22e1e76d04ff13d307301f8a762b8e01e37b8171d22af8c9ac2eb174c685338R9-R29).

Returning void from the serializer (instead of the number of written bytes) makes it easier for clients to implement serialization delegates (neither Google Protobuf nor STJ serializers return the length). Instead, the length can be determined at the call-site of the serialization delegate. For byte[], Memory<byte> and ReadOnlyMemory<byte> I added an overload for PublishAsync as an example here: https://github.com/nats-io/nats.net.v2/commit/61864dbe4d4d3091b020a401475a9e690c627004

ByRef-like types such as [ReadOnly]Span<byte> don't work with this design because they can't be used as generic arguments. I don't think that's too bad though, because their contents would need to be copied to the heap as soon as we hit an async path anyway.

If something like https://github.com/nats-io/nats.net.v2/pull/140#issuecomment-1747307102 is implemented, special handling for ReadOnlySpan<byte> could be added.

Deserialization

Deserialization looks slightly different, mainly because I opted to put the serializer into (now generic) NatsSubOpt<T>. I wanted to try it but I don't like the asymmetry, find it hard to discover and the parameter reordering is more than just unfortunate.

I don't see problems with return types, because the return type is T in NatsSubOpt<T>.

The deserializer itself is Func<IMemoryOwner<byte>, T>. As an alternative, we could pass ReadOnlyMemory<byte> to it, which would be nicer for consumers. Either is different from the current implementation that takes a ROS<byte>. The relational here is that copying the payload to a temporary buffer is (almost) guaranteed to increase receive throughput because after copying the buffer serialization can happen outside the receive loop, instead of waiting here:

https://github.com/nats-io/nats.net.v2/blob/caada52729ec0ab1e73f06c1c63e9130c01f789b/src/NATS.Client.Core/Internal/NatsReadProtocolProcessor.cs#L209C1-L209C1

Using IMemoryOwner instead of ROM enables buffer pooling. The idea is essentially the same as outlined in https://github.com/nats-io/nats.net.v2/pull/140#issuecomment-1747307102 for transmit.

However, it puts the burden of disposing the owner on the (client provided) deserializer, which I don't like. On the other hand, it's not too bad as long as the underlying pool implementation does not leak when failing to return buffers (IIRC that's true for Array/MemoryPool).

Other stuff:

jeffw-wherethebitsroam commented 1 year ago

I realize I'm quite late here, but I had also been bitten by the default Json serializer.

Personally, I would prefer the connection and serialization to be divided into 2 parts:

  1. A low level connection that just deals with bytes.
  2. A serializing connection that wraps the low level connection.

This would make it more obvious what is happening and allow a cleaner API for each. I would think that most users are going to be working with either serialized object or bytes directly, not both. It could also allow for the serialization of large objects over multiple NATS messages, which is one of my use cases.

cliffchapmanrbx commented 1 year ago

Just watching this great discussion from afar and wanted to jump in with a very specific nitpick:

However, it puts the burden of disposing the owner on the (client provided) deserializer, which I don't like. On the other hand, it's not too bad as long as the underlying pool implementation does not leak when failing to return buffers (IIRC that's true for Array/MemoryPool).

There's an excellent blog post here that walks through some of the differences between ArrayPool and MemoryPool. The notable one for this discussion is MemoryPools hand you an IMemoryOwner, which is disposable while an ArrayPool straight up hands you an array you are expected to Return(). The blog post talks about how this is a feature, as IMemoryOwner ends up being a heap allocation every time you need a new one. Allocation-less tight loops of small buffers will be impacted by the overhead of IMemoryOwner, so will drift towards ArrayPools.

The MSDN link for ArrayPool.Rent makes this tradeoff clear:

This buffer is loaned to the caller and should be returned to the same pool using the Return method, so that it can be reused in subsequent calls to the Rent method. Failure to return a rented buffer is not a fatal error. However, it may lead to decreased application performance, as the pool may need to create a new buffer to replace the lost one.

You're correct that it will not leak memory, but it may leak performance when the intention of the API surface is to be performant.

This comment isn't setting up an expectation of an allocation-free NATS client or anything. Just wanted to make sure the decisions made here are respecting the performance intent of these APIs 🙂

mtmk commented 1 year ago

I have a few questions. (partly because of my lack of understanding -need to deep dive at some point, partly to shape the design)

(edit) I think I got it the wrong way round. When receiving messages (i.e. subscriptions) IMemoryOwner works fine. It's when publishing you need something like IBufferWriter (or some kind of sequence builder?)

(edit2) An IBufferWriter implementation: https://github.com/CommunityToolkit/dotnet/blob/v8.2.1/src/CommunityToolkit.HighPerformance/Buffers/MemoryBufferWriter%7BT%7D.cs ...actually this is more interesting. Implements IMemoryOwner too: https://github.com/CommunityToolkit/dotnet/blob/v8.2.1/src/CommunityToolkit.HighPerformance/Buffers/ArrayPoolBufferWriter%7BT%7D.cs

jasper-d commented 12 months ago

Is there an implementation of IBufferWriter we can use, or do we need to implement one?

I think there are quite a lot, many of them private though. A public one I know is Sequence from Nerdbank.Streams. The general pattern is to have IBufferWriter<T> create a linked list of ReadOnlySequenceSegment<T> segments which can than be used to construct-system-int32-system-buffers-readonlysequencesegment((-0))-system-int32)) a ReadOnlySequence<T>.

When using ArrayPool are we suggesting GC to take care of the Return? If not how/when do we ensure the buffer is returned?

I think for publishing it's straight forward. Much like now, serializers can write to an IBufferWriter implementation (that would rent memory and construct sequence segments) and once the buffers are copied to the socket (or an intermediate buffer for that matter) they can be "freed" (e.g. returned to pool) by us.

GC is only the stop-gap. It won't return the buffer, but eventually collect them just like any other ordinary memory that has no live references pointing to it.

When receiving messages (i.e. subscriptions) IMemoryOwner works fine.

I agree, it works perfectly fine. The only concern I have is that it works best only as long as clients consuming the memory (e.g. deserializing the buffer) are properly disposing it afterwards. If they don't, no pooling will happen and performance will be somewhat worse than directly allocating buffers (e.g. new byte[]) without any pooling.

On the other hand, clients which don't dispose an IMemoryOwner are breaking the contract, so one could argue it's their fault. If that is an API that is easy enough to use is the maintainers call to make ;)

Have a look at the new NatsMemoryOwner implementation https://github.com/nats-io/nats.net.v2/blob/main/src/NATS.Client.Core/NatsMemoryOwner.cs Does that alleviate GC concerns for you?

Any implementation of IMemoryOwner will cause an allocation. I don't think it's that bad though.

For publishing, we can avoid it entirely and safely pool our buffers (i.e. ReadOnlySequenceSegments). It would only fail if a client holds on to a buffer after calling IBufferWriter.Advance which is just very wrong.

For subscriptions we would need to instantiate a IMemoryOwner, but right now there are a lot of allocations happening anyway, so one allocation more won't hurt that much (at the same time we most likely avoid one allocation by pooling the underlying buffer). The non-generic NatsMsg could actually double as an IMemoryOwner, avoiding any extra allocations in that case.

mtmk commented 11 months ago

Please have a quick look at #171 for the proposed solution.

jasper-d commented 11 months ago

cc: @renkman You might be interested in this too.