rsocket / rsocket-java

Java implementation of RSocket
http://rsocket.io
Apache License 2.0
2.36k stars 354 forks source link

RSocket vs HTTP performance #720

Open bogdansolga opened 4 years ago

bogdansolga commented 4 years ago

I have done some research on the performance of RSocket vs HTTP for a service-to-service communication of FHIR resources. According to the initial results, the RSocket performance seems to be lower (or, at most, equal) to the one provided by HTTP (using REST).

The full details of the research and the performance issues are thoroughly detailed in this StackOverflow post.

I am aware the discussion/issue would be more appropriate to be posted on the RSocket community space. As the page is not working and the StackOverflow issue may not be read by the appropriate people, I have posted the question also here. My apologies if it will be considered an inappropriate place.

OlegDokuka commented 4 years ago

Hello, @bogdansolga!

Briefly looking into the configuration, I may immediately say that the benchmark is inappropriately formed and obviously will give an incorrect result.

What I can say directly is that at this point you are calling readBundle method which subscribe to the requesterMono every time you execute a remote call, which basically open a new TCP connection on every subscription.

It means that behavior you achieve is absolutely identical to standard Http 1.0 behavior, which is terribly slow.

First of all, I would appreciate it if you close the StackOverflow question (since this is not a question), and continue the conversation here (I'm more than happy to help with configuring the correct setup).

As the first step in achieving the correct setup, I would recommend you to cache your Mono<RSocketRequester> in order to reuse the same connection for all the calls.

 @Bean
public Mono<RSocketRequester> requester(BundleDecoder bundleDecoder, IntegerEncoder integerEncoder) {
        final RSocketStrategies.Builder builder = RSocketStrategies.builder()
                                                                   .decoder(bundleDecoder)
                                                                   .encoder(integerEncoder);

        return RSocketRequester.builder()
                               .rsocketFactory(factory -> factory.dataMimeType(MediaType.APPLICATION_CBOR_VALUE)
                                                                 .frameDecoder(PayloadDecoder.ZERO_COPY))
                               .rsocketStrategies(builder.build())
                               .connectTcp(responderHost, responderPort)
                               .retry()
                               .cache();
}

Apart from that, I may recommend you to look at the usage of LoadBalancedRSocket which let you efficiently reuse a couple of connections for tons of calls -> https://github.com/OlegDokuka/rsocket-issue-717

bogdansolga commented 4 years ago

Thank you very much for your comments and help, @OlegDokuka!

Sure, I will close the StackOverflow post and I will kindly ask you to continue the discussions here. I will add the .cache() to the config and see if there are noticeable results. I will also study the LoadBalancedRSocket implementation, ASAP.

OlegDokuka commented 4 years ago

@bogdansolga NP. Let me know when you have any updates,

Regards, Oleh

bogdansolga commented 4 years ago

@OlegDokuka - I have added the caching of the Mono<RSocketRequester> and the performance seems to have improved a little, as you righteously indicated. However, with the current setup, the RSocket performance seems to be just marginally lower than the HTTP performance.

Here are some numbers - the averages of 20 service-to-service calls, each one performed for various payload sizes (the stringSizeInBytes field):

RSocket:

[ {
  "stringSizeInBytes" : 127561,
  "totalTime" : 43,
  "commTimePercentage" : "18.6%",
  "deserializingTimePercentage" : "81.4%"
}, {
  "stringSizeInBytes" : 254461,
  "totalTime" : 54,
  "commTimePercentage" : "16.67%",
  "deserializingTimePercentage" : "83.33%"
}, {
  "stringSizeInBytes" : 508261,
  "totalTime" : 114,
  "commTimePercentage" : "15.79%",
  "deserializingTimePercentage" : "84.21%"
}, {
  "stringSizeInBytes" : 1016433,
  "totalTime" : 238,
  "commTimePercentage" : "14.71%",
  "deserializingTimePercentage" : "85.29%"
} ]

HTTP:

[ {
  "stringSizeInBytes" : 127561,
  "totalTime" : 43,
  "commTimePercentage" : "16.28%",
  "deserializingTimePercentage" : "83.72%"
}, {
  "stringSizeInBytes" : 254461,
  "totalTime" : 69,
  "commTimePercentage" : "15.94%",
  "deserializingTimePercentage" : "84.06%"
}, {
  "stringSizeInBytes" : 508261,
  "totalTime" : 120,
  "commTimePercentage" : "14.17%",
  "deserializingTimePercentage" : "85.83%"
}, {
  "stringSizeInBytes" : 1016433,
  "totalTime" : 217,
  "commTimePercentage" : "12.9%",
  "deserializingTimePercentage" : "87.1%"
} ]

The key performance indicator - the commTimePercentage field, which represents the percentage (of the total time) spent in the (RSocket | HTTP) communication. If my understanding of RSocket is correct, the percentage should be much lower than the percentage for HTTP communication.

As far as I understand the overall communication flow, I think that further improvements can be obtained by improving the BundleEncoder and BundleDecoder classes, as they are the ones which are serializing and deserializing the transferred object (the FHIR resource). Maybe the communication will be more efficient if the serializing and deserializing will be done in/from a binary format, not in/from a String.

Any further comments and recommendations are welcome, @OlegDokuka . I will further investigate the LoadBalancedRSocket project, to see if/how I can reuse from it.

Thanks a lot, once again 👍

OlegDokuka commented 4 years ago

Alright, let me check out the code and play with it a little more!

I will be back to you later today or tomorrow.

Apart from that, it does not seems to be you are doing that much I/O and most of the time you are spending is on the serialization/deserialization. So, what are you trying to measure?

Regards, Oleh

bogdansolga commented 4 years ago

@OlegDokuka - here's an update and a good news: after having a look in your project, I have replaced the String serializing and deserializing with the byte SerializationUtils.serialize() and SerializationUtils.deserialize() and the numbers have slightly improved:

[ {
  "stringSizeInBytes" : 127561,
  "totalTime" : 30,
  "commTimePercentage" : "20%",
  "deserializingTimePercentage" : "80%"
}, {
  "stringSizeInBytes" : 254461,
  "totalTime" : 54,
  "commTimePercentage" : "12.96%",
  "deserializingTimePercentage" : "87.04%"
}, {
  "stringSizeInBytes" : 508261,
  "totalTime" : 101,
  "commTimePercentage" : "14.85%",
  "deserializingTimePercentage" : "85.15%"
}, {
  "stringSizeInBytes" : 1016433,
  "totalTime" : 214,
  "commTimePercentage" : "14.02%",
  "deserializingTimePercentage" : "85.98%"
} ]

I will further research the code and tweak the OutputStream sizes, hopefully I can further improve the numbers. Please let me know if you see any further improvements.

Thanks a (very) lot, once again :)

OlegDokuka commented 4 years ago

I mean, looking at the results, I still doubt it is correct since most of the time is spent on the serialization/deserialization. I will play with your code to ensure we can measure the performance of the communication and not the performance of other things

bogdansolga commented 4 years ago

Apart from that, it does not seems to be you are doing that much I/O and most of the time you are spending is on the serialization/deserialization. So, what are you trying to measure?

I am trying to measure the times spent in the communication of a big payload when it is exchanged via service-to-service calls, so that we can conclude if RSocket is better suited as a communication protocol in the context of a large distributed system, which entails a lot of service-to-service communication of FHIR resources.

The overall intent - find out if RSocket provides huge performance benefits compared to the HTTP communication, so that we can replace the (current) REST & HTTP communication with RSocket. Please let me know if you want more details.

rstoyanchev commented 4 years ago

@bogdansolga,

It's not clear what kind of hardware/infrastructure are you running these benchmarks on, but that is a very important aspect of any benchmark. Even if all things are equal, running client and server and/or multiple processes on a single machine can give false results.

You're configured for "zero copy" (direct memory) but your Decoder implementations do not release data buffers. That means you're only using more and more pooled buffers. You could be using the built-in StringDecoder at least, which correctly releases buffers, and then from String to whatever else you want. Moreover you are configured for https://cbor.io/ but are not actually using it. By default Spring Boot is configured for CBOR because it is a binary protocol and you should investigate using that, especially if serialization is a big part of what you're trying to measure.

Both HTTP and RSocket clients are blocking and executing requests sequentially, which pardon the analogy, is like driving a sports car in low gear. For the HTTP side you could be using the reactive WebClient which allows executing requests concurrently with a degree of parallelism you can choose. For the RSocket client, you have a TODO with a question to which the answer is yes there is a better way. In a reactive chain you don't want to block on each individual operation. Instead, return the Mono<String> and let the caller further compose, i.e. you never want to unwrap (just like you don't want to end java.util.Stream until you're done). In this case the caller is a benchmark trying to get through X number of requests. You could execute N at a time, flatMap the results, and wait for all to complete. So that's 1 block at the end and not X times blocking.

Taking even a further step back, while I don't claim to understand the domain model, this is sending a large number of entries (up to 800) in one go which results in very large 1MB data per payload, and that's aggregated in memory before being passed on or parsed. The strength of RSocket is that it has streaming built in. It would be much better to return a stream of those entries and process them as they come, which would give the benefit of back pressure. Again I don't know anything about the domain model but the issue of granularity of data is an important one to consider.

Along the lines of what @OlegDokuka has been pointing out. You're largely measuring the speed of serialization and deserialization. The vast majority of the time is spent in serialization where you have some inefficiencies as I pointed out. Even when you address those, you likely won't find a big difference in a scenario with a relatively small number of requests each with a sizable payload. As opposed to a large number of requests in parallel and/or a server making further remote calls, as is common in microservice scenarios, which adds extra latency and so on.

I realize I leave a lot of gaps to be filled here, but my goal is to give you some pointers. I would suggest learning a little more about composing application logic in reactive, declarative style, which is not unlike the java.util.Stream you already use extensively in the benchmark, but for streams of data. It may be bad form to leave a link to a talk of my own but I think this talk may give you a good intro that you can then complement with other learning resources.

bogdansolga commented 4 years ago

@rstoyanchev - thank you very much for your advice / pointers. I was aware about some of them, I wasn't aware about the others.

A few comments and further questions from my side:

I wasn't aware that I am not actually using CBOR, albeit I tried to configure the apps to use it. I certainly want to use it, as serialization is the biggest part of what I'm trying to measure. If there is a place where I can see more details on how to actually use CBOR - I will greatly appreciate if you can share the link.

You and @OlegDokuka are right, I am currently measuring especially the serialization and deserialization speed / overhead, as they are the ones which matter the most in our usage scenario. Regarding the number of requests, the remote calls entailed by a microservices architecture and their inherent latency - that is exactly the context for which I am trying to measure the RSocket efficiency, as my intent is to replace the communication mode in a distributed architecture of several (quasi-) microservices. The current communication is done using REST (over HTTP) and my intent is to replace it with RSocket communication, especially because a lot of the business logic entails service-to-service calls with multiple round-trip calls between services. Therefore, I am well aware of the latency added by multiple service-to-service calls and I want to minimize it as much as possible. Please let me know if my understanding of what you said is correct.

Last but not least - thank you very much for the link to your presentation; I appreciate it and I don't consider it a bad practice, at all. I have seen the presentation (approximately a year ago), I will re-see it now to refresh my reactive processing knowledge. I fully admit that my development focus was more on the RSocket communication and less on the reactive composition of the code. Now that I have 'assembled' a big part of the RSocket communication, encoding and decoding, I will further focus on the reactive composition of the benchmarking code.

Once again - thank you very much for all the provided information, hints and recommendations. Any further recommendations are extremely welcome.

nikitsenka commented 4 years ago

Would be great to see some official Webclient vs RSocket performance comparison report or tests examples which can be used as a good use case for how to use RSocket properly to get real benefits.