RSocket vs HTTP performance

bogdansolga commented 4 years ago

I have done some research on the performance of RSocket vs HTTP for a service-to-service communication of FHIR resources. According to the initial results, the RSocket performance seems to be lower (or, at most, equal) to the one provided by HTTP (using REST).

The full details of the research and the performance issues are thoroughly detailed in this StackOverflow post.

I am aware the discussion/issue would be more appropriate to be posted on the RSocket community space. As the page is not working and the StackOverflow issue may not be read by the appropriate people, I have posted the question also here. My apologies if it will be considered an inappropriate place.

OlegDokuka commented 4 years ago

Hello, @bogdansolga!

Briefly looking into the configuration, I may immediately say that the benchmark is inappropriately formed and obviously will give an incorrect result.

What I can say directly is that at this point you are calling readBundle method which subscribe to the requesterMono every time you execute a remote call, which basically open a new TCP connection on every subscription.

It means that behavior you achieve is absolutely identical to standard Http 1.0 behavior, which is terribly slow.

First of all, I would appreciate it if you close the StackOverflow question (since this is not a question), and continue the conversation here (I'm more than happy to help with configuring the correct setup).

As the first step in achieving the correct setup, I would recommend you to cache your Mono<RSocketRequester> in order to reuse the same connection for all the calls.

 @Bean
public Mono<RSocketRequester> requester(BundleDecoder bundleDecoder, IntegerEncoder integerEncoder) {
        final RSocketStrategies.Builder builder = RSocketStrategies.builder()
                                                                   .decoder(bundleDecoder)
                                                                   .encoder(integerEncoder);

        return RSocketRequester.builder()
                               .rsocketFactory(factory -> factory.dataMimeType(MediaType.APPLICATION_CBOR_VALUE)
                                                                 .frameDecoder(PayloadDecoder.ZERO_COPY))
                               .rsocketStrategies(builder.build())
                               .connectTcp(responderHost, responderPort)
                               .retry()
                               .cache();
}

Apart from that, I may recommend you to look at the usage of LoadBalancedRSocket which let you efficiently reuse a couple of connections for tons of calls -> https://github.com/OlegDokuka/rsocket-issue-717

bogdansolga commented 4 years ago

Thank you very much for your comments and help, @OlegDokuka!

Sure, I will close the StackOverflow post and I will kindly ask you to continue the discussions here. I will add the .cache() to the config and see if there are noticeable results. I will also study the LoadBalancedRSocket implementation, ASAP.

OlegDokuka commented 4 years ago

@bogdansolga NP. Let me know when you have any updates,

Regards, Oleh

bogdansolga commented 4 years ago

@OlegDokuka - I have added the caching of the Mono<RSocketRequester> and the performance seems to have improved a little, as you righteously indicated. However, with the current setup, the RSocket performance seems to be just marginally lower than the HTTP performance.

Here are some numbers - the averages of 20 service-to-service calls, each one performed for various payload sizes (the stringSizeInBytes field):

RSocket:

[ {
  "stringSizeInBytes" : 127561,
  "totalTime" : 43,
  "commTimePercentage" : "18.6%",
  "deserializingTimePercentage" : "81.4%"
}, {
  "stringSizeInBytes" : 254461,
  "totalTime" : 54,
  "commTimePercentage" : "16.67%",
  "deserializingTimePercentage" : "83.33%"
}, {
  "stringSizeInBytes" : 508261,
  "totalTime" : 114,
  "commTimePercentage" : "15.79%",
  "deserializingTimePercentage" : "84.21%"
}, {
  "stringSizeInBytes" : 1016433,
  "totalTime" : 238,
  "commTimePercentage" : "14.71%",
  "deserializingTimePercentage" : "85.29%"
} ]

HTTP:

[ {
  "stringSizeInBytes" : 127561,
  "totalTime" : 43,
  "commTimePercentage" : "16.28%",
  "deserializingTimePercentage" : "83.72%"
}, {
  "stringSizeInBytes" : 254461,
  "totalTime" : 69,
  "commTimePercentage" : "15.94%",
  "deserializingTimePercentage" : "84.06%"
}, {
  "stringSizeInBytes" : 508261,
  "totalTime" : 120,
  "commTimePercentage" : "14.17%",
  "deserializingTimePercentage" : "85.83%"
}, {
  "stringSizeInBytes" : 1016433,
  "totalTime" : 217,
  "commTimePercentage" : "12.9%",
  "deserializingTimePercentage" : "87.1%"
} ]

The key performance indicator - the commTimePercentage field, which represents the percentage (of the total time) spent in the (RSocket | HTTP) communication. If my understanding of RSocket is correct, the percentage should be much lower than the percentage for HTTP communication.

As far as I understand the overall communication flow, I think that further improvements can be obtained by improving the BundleEncoder and BundleDecoder classes, as they are the ones which are serializing and deserializing the transferred object (the FHIR resource). Maybe the communication will be more efficient if the serializing and deserializing will be done in/from a binary format, not in/from a String.

Any further comments and recommendations are welcome, @OlegDokuka . I will further investigate the LoadBalancedRSocket project, to see if/how I can reuse from it.

Thanks a lot, once again 👍

OlegDokuka commented 4 years ago

Alright, let me check out the code and play with it a little more!

I will be back to you later today or tomorrow.

Apart from that, it does not seems to be you are doing that much I/O and most of the time you are spending is on the serialization/deserialization. So, what are you trying to measure?

Regards, Oleh

bogdansolga commented 4 years ago

@OlegDokuka - here's an update and a good news: after having a look in your project, I have replaced the String serializing and deserializing with the byte SerializationUtils.serialize() and SerializationUtils.deserialize() and the numbers have slightly improved:

[ {
  "stringSizeInBytes" : 127561,
  "totalTime" : 30,
  "commTimePercentage" : "20%",
  "deserializingTimePercentage" : "80%"
}, {
  "stringSizeInBytes" : 254461,
  "totalTime" : 54,
  "commTimePercentage" : "12.96%",
  "deserializingTimePercentage" : "87.04%"
}, {
  "stringSizeInBytes" : 508261,
  "totalTime" : 101,
  "commTimePercentage" : "14.85%",
  "deserializingTimePercentage" : "85.15%"
}, {
  "stringSizeInBytes" : 1016433,
  "totalTime" : 214,
  "commTimePercentage" : "14.02%",
  "deserializingTimePercentage" : "85.98%"
} ]

I will further research the code and tweak the OutputStream sizes, hopefully I can further improve the numbers. Please let me know if you see any further improvements.

Thanks a (very) lot, once again :)

OlegDokuka commented 4 years ago

I mean, looking at the results, I still doubt it is correct since most of the time is spent on the serialization/deserialization. I will play with your code to ensure we can measure the performance of the communication and not the performance of other things

bogdansolga commented 4 years ago

Apart from that, it does not seems to be you are doing that much I/O and most of the time you are spending is on the serialization/deserialization. So, what are you trying to measure?

I am trying to measure the times spent in the communication of a big payload when it is exchanged via service-to-service calls, so that we can conclude if RSocket is better suited as a communication protocol in the context of a large distributed system, which entails a lot of service-to-service communication of FHIR resources.

The overall intent - find out if RSocket provides huge performance benefits compared to the HTTP communication, so that we can replace the (current) REST & HTTP communication with RSocket. Please let me know if you want more details.

rstoyanchev commented 4 years ago

@bogdansolga,

It's not clear what kind of hardware/infrastructure are you running these benchmarks on, but that is a very important aspect of any benchmark. Even if all things are equal, running client and server and/or multiple processes on a single machine can give false results.

You're configured for "zero copy" (direct memory) but your Decoder implementations do not release data buffers. That means you're only using more and more pooled buffers. You could be using the built-in StringDecoder at least, which correctly releases buffers, and then from String to whatever else you want. Moreover you are configured for https://cbor.io/ but are not actually using it. By default Spring Boot is configured for CBOR because it is a binary protocol and you should investigate using that, especially if serialization is a big part of what you're trying to measure.

Both HTTP and RSocket clients are blocking and executing requests sequentially, which pardon the analogy, is like driving a sports car in low gear. For the HTTP side you could be using the reactive WebClient which allows executing requests concurrently with a degree of parallelism you can choose. For the RSocket client, you have a TODO with a question to which the answer is yes there is a better way. In a reactive chain you don't want to block on each individual operation. Instead, return the Mono<String> and let the caller further compose, i.e. you never want to unwrap (just like you don't want to end java.util.Stream until you're done). In this case the caller is a benchmark trying to get through X number of requests. You could execute N at a time, flatMap the results, and wait for all to complete. So that's 1 block at the end and not X times blocking.

Taking even a further step back, while I don't claim to understand the domain model, this is sending a large number of entries (up to 800) in one go which results in very large 1MB data per payload, and that's aggregated in memory before being passed on or parsed. The strength of RSocket is that it has streaming built in. It would be much better to return a stream of those entries and process them as they come, which would give the benefit of back pressure. Again I don't know anything about the domain model but the issue of granularity of data is an important one to consider.

Along the lines of what @OlegDokuka has been pointing out. You're largely measuring the speed of serialization and deserialization. The vast majority of the time is spent in serialization where you have some inefficiencies as I pointed out. Even when you address those, you likely won't find a big difference in a scenario with a relatively small number of requests each with a sizable payload. As opposed to a large number of requests in parallel and/or a server making further remote calls, as is common in microservice scenarios, which adds extra latency and so on.

I realize I leave a lot of gaps to be filled here, but my goal is to give you some pointers. I would suggest learning a little more about composing application logic in reactive, declarative style, which is not unlike the java.util.Stream you already use extensively in the benchmark, but for streams of data. It may be bad form to leave a link to a talk of my own but I think this talk may give you a good intro that you can then complement with other learning resources.

bogdansolga commented 4 years ago

@rstoyanchev - thank you very much for your advice / pointers. I was aware about some of them, I wasn't aware about the others.

A few comments and further questions from my side:

regarding the infrastructure that the actual code will run in - it will run in a Cloud Foundry environment, where the requester and responder will be implemented by different parties
regarding
You're configured for "zero copy" (direct memory) but your Decoder implementations do not release data buffers. That means you're only using more and more pooled buffers. You could be using the built-in StringDecoder at least, which correctly releases buffers, and then from String to whatever else you want
- I wasn't aware that the decoders should release data buffers, as I haven't found any documentation on how they should be implemented. I have implemented the two Decoders by having a look at some of the existing decoders from Spring Boot. If there is a documentation on how to implement custom Decoders and Encoders - I would be very grateful if you can give me the link to it, as I haven't found it in my searches. If there isn't a documentation on it, I will further look at the existing Decoders and try to understand its usage.
- regarding the usage of the StringDecoder - given the new details that I've learned today (mainly from @OlegDokuka's comments), I no longer see any benefit from using the StringDecoder because the encoding and decoding to/from the desired FHIR resource already works, albeit it can be further tuned. I have used an intermediary conversion to/from a String in order to be able to send the payload between the two services, until I managed to encode and decode the resource properly (that's what the BundleEncoder and BundleDecoder are used for)
regarding:

you are configured for https://cbor.io/ but are not actually using it. By default Spring Boot is configured for CBOR because it is a binary protocol and you should investigate using that, especially if serialization is a big part of what you're trying to measure.

I wasn't aware that I am not actually using CBOR, albeit I tried to configure the apps to use it. I certainly want to use it, as serialization is the biggest part of what I'm trying to measure. If there is a place where I can see more details on how to actually use CBOR - I will greatly appreciate if you can share the link.

regarding the usage of the reactive WebClient instead of the RestTemplate - I am well aware about the performance benefits brought by the WebClient. The only reason why I've used RestTemplate is a (sort of) impediment - I was offline at the time when I have implemented the REST communication, and I only had the spring-boot-starter-web library added to the project :) I will change it to the WebClient, soon.
regarding the unwrapping of the Mono<?> - I was aware that blocking was not the best way, but I didn't know which was the better way. Thank you very much for telling me which is the better way, I'll change it to a much more efficient flatMapping and (eventually) do a single block at the end.
regarding the sending of a big payload (~1 MB) with a maximum of 800 entries in it - I am aware about the big payload size and about the request-stream communication model supported by RSocket. There are several reasons why I have chosen the current approach, for now:
- the responding party (the party which will respond with that big payload) may not be able to integrate the request-stream support since the beginning, hence I wanted to know how the communication behaves in that scenario
- I wanted to first measure the performance of the (classical) request-response communication model, to be able to compare it 'head-to-head' with the REST communication I intend to also integrate the request-stream communication model in the benchmark, to see the benefits brought by it. Given the (already very good numbers), I presume the numbers will be extremely low, which is what I wanted/hoped to obtain.
regarding

You're largely measuring the speed of serialization and deserialization. The vast majority of the time is spent in serialization where you have some inefficiencies as I pointed out. Even when you address those, you likely won't find a big difference in a scenario with a relatively small number of requests each with a sizable payload. As opposed to a large number of requests in parallel and/or a server making further remote calls, as is common in microservice scenarios, which adds extra latency and so on.

You and @OlegDokuka are right, I am currently measuring especially the serialization and deserialization speed / overhead, as they are the ones which matter the most in our usage scenario. Regarding the number of requests, the remote calls entailed by a microservices architecture and their inherent latency - that is exactly the context for which I am trying to measure the RSocket efficiency, as my intent is to replace the communication mode in a distributed architecture of several (quasi-) microservices. The current communication is done using REST (over HTTP) and my intent is to replace it with RSocket communication, especially because a lot of the business logic entails service-to-service calls with multiple round-trip calls between services. Therefore, I am well aware of the latency added by multiple service-to-service calls and I want to minimize it as much as possible. Please let me know if my understanding of what you said is correct.

Last but not least - thank you very much for the link to your presentation; I appreciate it and I don't consider it a bad practice, at all. I have seen the presentation (approximately a year ago), I will re-see it now to refresh my reactive processing knowledge. I fully admit that my development focus was more on the RSocket communication and less on the reactive composition of the code. Now that I have 'assembled' a big part of the RSocket communication, encoding and decoding, I will further focus on the reactive composition of the benchmarking code.

Once again - thank you very much for all the provided information, hints and recommendations. Any further recommendations are extremely welcome.

nikitsenka commented 4 years ago

Would be great to see some official Webclient vs RSocket performance comparison report or tests examples which can be used as a good use case for how to use RSocket properly to get real benefits.

rsocket / rsocket-java

RSocket vs HTTP performance #720