Allow usage of publishOn scheduler for reactive signal emission

mp911de commented 6 years ago

Motivated by https://stackoverflow.com/questions/53001283/reactive-redis-lettuce-always-publishing-to-single-thread, we should investigate (benchmark) thread switching for response processing.

In a reactive/async scenario, where responses are processed on an I/O thread, there may occur congestion and we should create various scenarios to measure impact and how to address this.

jmnght commented 6 years ago

@mp911de I'm seeing similar issue with Reactor Netty + Webflux + Reactor + Lettuce. I was doing stress test on my application and if i send a 100 requests to my webserver, I'm seeing the 4 netty threads are busy processing (essentially calling subscribe) to all the requests. Now the single Lettuce thread responsible for fetching data from Redis is not really "getting its turn" until the 4 netty threads call subscribe on almost all of the 100 requests. Lettuce thread gets to process few times during this time, but once the flood of requests are all subscribed, lettuce picks up speed. It kind of seems unfair in the sense that if requests keep coming, there is a significant impact on request processing times. Also in a real prod environment, this is gonna create a huge delay in processing..

mp911de commented 6 years ago

First test results:

Synchronous/Spring Boot/Tomcat

$ wrk -c 4 -d 10 -t 4 http://localhost:8080  
Running 10s test @ http://localhost:8080
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   706.04us  410.71us  10.01ms   96.65%
    Req/Sec     1.47k   221.33     1.61k    94.06%
  59167 requests in 10.10s, 10.05MB read
Requests/sec:   5856.68
Transfer/sec:      1.00MB

$ wrk -c 400 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    42.26ms   12.08ms  68.05ms   71.66%
    Req/Sec     2.01k   219.85     2.69k    90.00%
  80186 requests in 10.02s, 13.62MB read
  Socket errors: connect 0, read 371, write 0, timeout 0
Requests/sec:   8005.37
Transfer/sec:      1.36MB

Reactive/Spring Boot/WebFlux

$ wrk -c 4 -d 10 -t 4 http://localhost:8080  
Running 10s test @ http://localhost:8080
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   772.73us  408.56us  12.88ms   95.66%
    Req/Sec     1.34k   225.01     1.60k    86.88%
  53713 requests in 10.10s, 6.35MB read
Requests/sec:   5317.69
Transfer/sec:    643.96KB

$ wrk -c 400 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    53.91ms   15.57ms 125.04ms   69.35%
    Req/Sec     1.59k   300.71     2.02k    63.00%
  63762 requests in 10.07s, 7.54MB read
  Socket errors: connect 0, read 384, write 0, timeout 0
Requests/sec:   6332.87
Transfer/sec:    766.87KB

Application under test: https://gist.github.com/mp911de/ba9c46fe8e8bbe4558ca445e996f108a

Conclusion: Cannot reproduce significant difference between reactive and synchronous processing times. Higher load (400 concurrent requests) results in a difference of 10ms.

mp911de commented 6 years ago

Please provide a reproducer, otherwise we cannot trace the issue.

jmnght commented 5 years ago

As you asked, I have created a sample project to show the lettuce performance issue, based on your demo application. https://github.com/jmnght/lettucetest It uses embedded redis and exposes 2 endpoints The first get just returns a list of items, the second one gets list of items from redis and returns when i run jmeter, i see the application handles around 4000 reqests/sec when not using /getFromRedis endpoint when getting from redis, it drops down to ~100 req/sec @mp911de Can you take a look and provide some feedback?

jmnght commented 5 years ago

$ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis Running 10s test @ http://localhost:8080/fromRedis 4 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 20.11ms 84.63ms 615.90ms 95.13% Req/Sec 271.16 247.91 0.90k 56.36% 1952 requests in 10.10s, 352.75KB read Socket errors: connect 0, read 0, write 0, timeout 12 Requests/sec: 193.21 Transfer/sec: 34.92KB

$ wrk -c 4 -d 10 -t 4 http://localhost:8080 Running 10s test @ http://localhost:8080 4 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 529.69us 1.03ms 32.24ms 96.46% Req/Sec 2.44k 382.23 3.19k 75.00% 98242 requests in 10.10s, 17.33MB read Requests/sec: 9723.83 Transfer/sec: 1.72MB

mp911de commented 5 years ago

Here are my test results with the code repo you provided:

$ wrk -c 4 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.11ms    8.28ms 130.24ms   98.83%
    Req/Sec     3.51k   527.91     3.89k    95.00%
  139963 requests in 10.10s, 33.77MB read
Requests/sec:  13857.60
Transfer/sec:      3.34MB

$ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.39ms    2.18ms  44.80ms   98.14%
    Req/Sec   848.70    131.12     1.13k    71.50%
  33810 requests in 10.01s, 8.16MB read
Requests/sec:   3377.87
Transfer/sec:    834.60KB

After removing system.out calls:

$ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.47ms    8.46ms 131.72ms   98.59%
    Req/Sec     1.78k   317.14     2.05k    92.68%
  70206 requests in 10.01s, 16.94MB read
Requests/sec:   7016.71
Transfer/sec:      1.69MB

These numbers seem pretty reasonable to me, 7000 operations per second on a single machine. Not sure why you see a slowdown to 190 operations per second. After I removed the system.out calls, throughput almost doubled. It's not healthy to use sysout as all output is via PrintStream is synchronized.

jmnght commented 5 years ago

@mp911de What machine are you running it? The most I'm getting is

➜  bin wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.09ms   17.88ms 214.67ms   97.28%
    Req/Sec   115.18     30.45   170.00     63.38%
  4584 requests in 10.06s, 1.11MB read
Requests/sec:    455.53
Transfer/sec:    112.60KB
➜  bin wrk -c 4 -d 10 -t 4 http://localhost:8080/
Running 10s test @ http://localhost:8080/
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.03ms    2.82ms  56.02ms   95.63%
    Req/Sec     1.93k   474.56     2.73k    76.75%
  76931 requests in 10.01s, 18.56MB read
Requests/sec:   7686.60
Transfer/sec:      1.85MB

mp911de commented 5 years ago

Just a regular MacBook (2,8 GHz Intel Core i7), nothing crazy. Do you have a profiler to measure what's causing the slowdown?

jmnght commented 5 years ago

@mp911de I'm running on a 3.1 GHz i7 MacBook, dual core. Not sure what was causing the slowdown but redoing the test after restarting laptop gave me ~4900 reques/sec

➜  ~ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   816.76us  376.35us  15.16ms   97.18%
    Req/Sec     1.25k   106.36     1.51k    79.70%
  50095 requests in 10.10s, 12.09MB read
Requests/sec:   4959.46
Transfer/sec:      1.20MB

But to follow up on the actual issue, i updated the response size to ~300KB and the req/sec dropped to ~100 which is what I'm seeing in our app servers. Is that expected? I have updated the sample code here: https://github.com/jmnght/lettucetest

Results:

➜  ~ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    35.57ms   19.86ms 139.73ms   92.87%
    Req/Sec    30.30      7.86    50.00     57.88%
  1214 requests in 10.07s, 448.51MB read
Requests/sec:    120.53
Transfer/sec:     44.53MB

Thanks!

mp911de commented 5 years ago

Thanks a lot for the additional details, @jmnght. Increasing the payload size really makes a difference and helps to reveal code that does not scale well.

I spotted various things here and tried different things:

Jackson requires a lot of time_ to decode binary data into JSON objects.
I tried to simplify things by replacing Jackson with plain String. Things improve but by doing so I found a performance issue in Spring WebFlux. The throughput isn't perfect but it improved.
Another simplification is passing-through objects as ByteBuffer/byte[]. This is by far the fastest throughput.

I attached my wrk results below.

Okay, so what's the conclusion? I think the combination of an expensive serialization mechanism parired with the single-threaded nature of a single connection have a measurable impact. I will do further experiments with multi-threaded scenarios to see how we can improve here.

Using Jackson (original code at https://github.com/jmnght/lettucetest/commit/6f2c221d85f91722106f83b013ecca74217ce0ec)

 ~  wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
  1 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    21.12ms    1.14ms  28.53ms   93.99%
    Req/Sec   189.62      6.54   202.00     72.00%
  948 requests in 5.02s, 350.21MB read
Requests/sec:    188.94
Transfer/sec:     69.80MB

Write HGETALL content as String:

~  wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
  1 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.24ms    0.88ms  13.20ms   85.06%
    Req/Sec   553.86     33.64   610.00     74.51%
  2812 requests in 5.10s, 1.01GB read
Requests/sec:    551.37
Transfer/sec:    203.75MB

Write HGETALL content as byte[]:

~  wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
  1 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.19ms  225.60us   4.82ms   83.02%
    Req/Sec     3.25k   190.54     3.44k    88.24%
  16511 requests in 5.10s, 5.96GB read
Requests/sec:   3237.87
Transfer/sec:      1.17GB

mp911de commented 5 years ago

I introduced a Scheduler option to Lettuce so data signals are dispatched on ClientResources#eventExecutorGroup() for now. I think I will make this option configurable via ClientOptions whether a specific client instance should emit items directly (default) or using the a different scheduler. So basically, we're re-introducing multi-threading with this change whereas a single connection effectively leads to single-threaded-like behavior.

I deployed a snapshot build that emits data signals on a multi-threaded scheduler. @jmnght Can you give version 5.1.4.gh905-SNAPSHOT a try and report back whether you see a similar positive effect?

<dependency>
  <groupId>io.lettuce</groupId>
  <artifactId>lettuce-core</artifactId>
  <version>5.1.4.gh905-SNAPSHOT</version>
</dependency>

<repositories>
  <repository>
    <id>sonatype-snapshots</id>
    <name>Sonatype Snapshot Repository</name>
    <url>https://oss.sonatype.org/content/repositories/snapshots/</url>
    <snapshots>
      <enabled>true</enabled>
    </snapshots>
  </repository>
</repositories>

Emitting data signals directly:

 ~  wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
  1 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    21.80ms    2.06ms  32.30ms   93.14%
    Req/Sec   183.64     14.14   202.00     76.00%
  918 requests in 5.02s, 339.13MB read
Requests/sec:    183.00
Transfer/sec:     67.60MB

Emitting data signals on a scheduler:

 ~  wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
  1 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.09ms    3.42ms  22.06ms   64.09%
    Req/Sec   659.58    124.00     0.96k    72.00%
  3287 requests in 5.01s, 0.87GB read
Requests/sec:    656.40
Transfer/sec:    177.22MB

jmnght commented 5 years ago

@mp911de Thank you for analyzing it. I will try the SNAPSHOT version and report back. I have a few questions though:

What profiler are you using? Looks like visual vm but the flame icons looks neat.
You mentioned passing through bytebuffer/byte[] had much better performance. Whats your thought on lettuce thread just returning byte[] without really deserializing, and then doing the deserialization on a separate pool by doing a publishOn() ops in controller so that lettuce thread is not impacted?
I'm using a dual core Macbook Pro with 3.1GHz i7, but my requests/sec are most of the time 50% of what you are seeing. Are you using quad core? Any thoughts on how to debug that?

mp911de commented 5 years ago

Regarding your questions:

It's YourKit
That's what is/will happen in Spring Data Redis. In Spring Data Redis, Lettuce is used with a binary codec and all serialization happens in Spring Data Redis itself. From all the reports I see is addressing the issue in Lettuce the best options. This allows users to decide whether they want to offload deserialization/data signals to a separate thread or whether they like to keep things as-is.
See https://github.com/lettuce-io/lettuce-core/issues/905#issuecomment-440171126. If you have two cores, then you can effectively run only two threads. Any number higher than 2 increases overhead and decreases throughput.

jmnght commented 5 years ago

@mp911de I tried the snapshot and the performance improvements I'm seeing are similar to what you are seeing. Number of reqests/second handled is little more than double of what it is currently. Until this fix gets released, I think i will have to have the deserializer wrap the byte[] in a ByteBuffer and return (without actually deseriallizing) so that Lettuce thread is not blocked, unless you have another suggestion. :)

mp911de commented 5 years ago

Thanks for reporting back. There's no sensible workaround beyond raw consumption of bytes and handling deserialization in your own code and applying publishOn(Scheduler.parallel()) before deserialization.

jmnght commented 5 years ago

@mp911de Thank you for adding this to 5.1.4 milestone. For now, I've changed usage to something like: redisTemplate.execute(conn -> conn.hashCommands().hGetAll(ByteBuffer.wrap("master-key-proto".getBytes()))) .publishOn(Schedulers.parallel()) .map(Map.Entry::getValue) .map(val -> mapper.readValue(val, Parent.class));

Throughput is close to twice (~400 req/sec) in version 5.1.2, but nowhere close to 3237 req/sec that you were seeing with HGETALL content as byte[]

jmnght commented 5 years ago

@mp911de , I integrated this change in the actual application and the improvement was much lower than what I saw in the test application. Could you share the code snippet that gave you 3237 req/sec with byte[] ? (or feel free to commit to jmnght/lettucetest repo) I will also be trying this in the actual production environment and will report back on what I find.

mp911de commented 5 years ago

@jmnght I pushed the change to master and 5.1.x and you can consume the new artifacts using versions 5.2.0.BUILD-SNAPSHOT respective 5.1.4.BUILD-SNAPSHOT.

When using Spring Boot, then you can customize Lettuce via:

@Configuration
class LettuceCustomizer implements LettuceClientConfigurationBuilderCustomizer {

  @Override
  public void customize(LettuceClientConfigurationBuilder clientConfigurationBuilder) {
    clientConfigurationBuilder.clientOptions(ClientOptions.builder().publishOnScheduler(true).build())
  }
}

jmnght commented 5 years ago

@mp911de Any metrics on performance with the above changes?

adrianotr commented 5 years ago

I will not be able to share my code here, but I can confirm that my use case is running 4x faster with the above change. I'm heavily using spring data redis in my project.

mp911de commented 5 years ago

We've seen already metrics measured from an external system perspective. Throughput scales with the number of utilized cores instead of capping throughput at single-thread (single-core) capacity.

From a JMH perspective, there's a notable penalty, however users have the choice to enable or to disable publish on scheduler:

Direct Publish:
Benchmark                               Mode  Cnt      Score       Error  Units
RedisClientBenchmark.reactiveSetBatch  thrpt    5  72636,061 ± 14977,276  ops/s

Use Publish Scheduler:
Benchmark                               Mode  Cnt      Score       Error  Units
RedisClientBenchmark.reactiveSetBatch  thrpt    5  56656,378 ± 18703,055  ops/s

I'm closing this ticket as it is resolved now.

trueinsider commented 5 years ago

@mp911de This change introduces some kind of bug where result is lost. I was able to reproduce this with script execution:

RedisClient client1 = RedisClient.create("redis://localhost");
RedisClient client2 = RedisClient.create("redis://localhost");
client1.setOptions(ClientOptions.builder().publishOnScheduler(false).build());
client2.setOptions(ClientOptions.builder().publishOnScheduler(true).build());
RedisReactiveCommands<String, String> redis1 = client1.connect().reactive();
RedisReactiveCommands<String, String> redis2 = client2.connect().reactive();
int counter1 = 0;
int counter2 = 0;
for (int i = 0; i < 1000; i++) {
    if (redis1.eval("return 1", INTEGER).next().block() == null) counter1++;
    if (redis2.eval("return 1", INTEGER).next().block() == null) counter2++;
}
out.println(counter1);
out.println(counter2);

Basically, counter1 == counter2 == 0 should be true, but counter2 has value 1-100 instead of 0.

Also, it would be great to be able to specify our own executor, instead of Lettuce's one.

mp911de commented 5 years ago

@trueinsider Can you please file a ticket for the bug report.

Re Executor: You can do so by providing an EventExecutorGroup object.

redis / lettuce

Allow usage of publishOn scheduler for reactive signal emission #905