Closed mp911de closed 5 years ago
@mp911de I'm seeing similar issue with Reactor Netty + Webflux + Reactor + Lettuce. I was doing stress test on my application and if i send a 100 requests to my webserver, I'm seeing the 4 netty threads are busy processing (essentially calling subscribe) to all the requests. Now the single Lettuce thread responsible for fetching data from Redis is not really "getting its turn" until the 4 netty threads call subscribe on almost all of the 100 requests. Lettuce thread gets to process few times during this time, but once the flood of requests are all subscribed, lettuce picks up speed. It kind of seems unfair in the sense that if requests keep coming, there is a significant impact on request processing times. Also in a real prod environment, this is gonna create a huge delay in processing..
First test results:
Synchronous/Spring Boot/Tomcat
$ wrk -c 4 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 706.04us 410.71us 10.01ms 96.65%
Req/Sec 1.47k 221.33 1.61k 94.06%
59167 requests in 10.10s, 10.05MB read
Requests/sec: 5856.68
Transfer/sec: 1.00MB
$ wrk -c 400 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 42.26ms 12.08ms 68.05ms 71.66%
Req/Sec 2.01k 219.85 2.69k 90.00%
80186 requests in 10.02s, 13.62MB read
Socket errors: connect 0, read 371, write 0, timeout 0
Requests/sec: 8005.37
Transfer/sec: 1.36MB
Reactive/Spring Boot/WebFlux
$ wrk -c 4 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 772.73us 408.56us 12.88ms 95.66%
Req/Sec 1.34k 225.01 1.60k 86.88%
53713 requests in 10.10s, 6.35MB read
Requests/sec: 5317.69
Transfer/sec: 643.96KB
$ wrk -c 400 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
4 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 53.91ms 15.57ms 125.04ms 69.35%
Req/Sec 1.59k 300.71 2.02k 63.00%
63762 requests in 10.07s, 7.54MB read
Socket errors: connect 0, read 384, write 0, timeout 0
Requests/sec: 6332.87
Transfer/sec: 766.87KB
Application under test: https://gist.github.com/mp911de/ba9c46fe8e8bbe4558ca445e996f108a
Conclusion: Cannot reproduce significant difference between reactive and synchronous processing times. Higher load (400 concurrent requests) results in a difference of 10ms.
Please provide a reproducer, otherwise we cannot trace the issue.
As you asked, I have created a sample project to show the lettuce performance issue, based on your demo application. https://github.com/jmnght/lettucetest It uses embedded redis and exposes 2 endpoints The first get just returns a list of items, the second one gets list of items from redis and returns when i run jmeter, i see the application handles around 4000 reqests/sec when not using /getFromRedis endpoint when getting from redis, it drops down to ~100 req/sec @mp911de Can you take a look and provide some feedback?
$ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis Running 10s test @ http://localhost:8080/fromRedis 4 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 20.11ms 84.63ms 615.90ms 95.13% Req/Sec 271.16 247.91 0.90k 56.36% 1952 requests in 10.10s, 352.75KB read Socket errors: connect 0, read 0, write 0, timeout 12 Requests/sec: 193.21 Transfer/sec: 34.92KB
$ wrk -c 4 -d 10 -t 4 http://localhost:8080 Running 10s test @ http://localhost:8080 4 threads and 4 connections Thread Stats Avg Stdev Max +/- Stdev Latency 529.69us 1.03ms 32.24ms 96.46% Req/Sec 2.44k 382.23 3.19k 75.00% 98242 requests in 10.10s, 17.33MB read Requests/sec: 9723.83 Transfer/sec: 1.72MB
Here are my test results with the code repo you provided:
$ wrk -c 4 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.11ms 8.28ms 130.24ms 98.83%
Req/Sec 3.51k 527.91 3.89k 95.00%
139963 requests in 10.10s, 33.77MB read
Requests/sec: 13857.60
Transfer/sec: 3.34MB
$ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.39ms 2.18ms 44.80ms 98.14%
Req/Sec 848.70 131.12 1.13k 71.50%
33810 requests in 10.01s, 8.16MB read
Requests/sec: 3377.87
Transfer/sec: 834.60KB
After removing system.out
calls:
$ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.47ms 8.46ms 131.72ms 98.59%
Req/Sec 1.78k 317.14 2.05k 92.68%
70206 requests in 10.01s, 16.94MB read
Requests/sec: 7016.71
Transfer/sec: 1.69MB
These numbers seem pretty reasonable to me, 7000 operations per second on a single machine. Not sure why you see a slowdown to 190 operations per second. After I removed the system.out calls, throughput almost doubled. It's not healthy to use sysout as all output is via PrintStream
is synchronized.
@mp911de What machine are you running it? The most I'm getting is
➜ bin wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 11.09ms 17.88ms 214.67ms 97.28%
Req/Sec 115.18 30.45 170.00 63.38%
4584 requests in 10.06s, 1.11MB read
Requests/sec: 455.53
Transfer/sec: 112.60KB
➜ bin wrk -c 4 -d 10 -t 4 http://localhost:8080/
Running 10s test @ http://localhost:8080/
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.03ms 2.82ms 56.02ms 95.63%
Req/Sec 1.93k 474.56 2.73k 76.75%
76931 requests in 10.01s, 18.56MB read
Requests/sec: 7686.60
Transfer/sec: 1.85MB
Just a regular MacBook (2,8 GHz Intel Core i7), nothing crazy. Do you have a profiler to measure what's causing the slowdown?
@mp911de I'm running on a 3.1 GHz i7 MacBook, dual core. Not sure what was causing the slowdown but redoing the test after restarting laptop gave me ~4900 reques/sec
➜ ~ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 816.76us 376.35us 15.16ms 97.18%
Req/Sec 1.25k 106.36 1.51k 79.70%
50095 requests in 10.10s, 12.09MB read
Requests/sec: 4959.46
Transfer/sec: 1.20MB
But to follow up on the actual issue, i updated the response size to ~300KB and the req/sec dropped to ~100 which is what I'm seeing in our app servers. Is that expected? I have updated the sample code here: https://github.com/jmnght/lettucetest
Results:
➜ ~ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 35.57ms 19.86ms 139.73ms 92.87%
Req/Sec 30.30 7.86 50.00 57.88%
1214 requests in 10.07s, 448.51MB read
Requests/sec: 120.53
Transfer/sec: 44.53MB
Thanks!
Thanks a lot for the additional details, @jmnght. Increasing the payload size really makes a difference and helps to reveal code that does not scale well.
I spotted various things here and tried different things:
Jackson requires a lot of time_ to decode binary data into JSON objects.
I tried to simplify things by replacing Jackson with plain String
. Things improve but by doing so I found a performance issue in Spring WebFlux. The throughput isn't perfect but it improved.
Another simplification is passing-through objects as ByteBuffer
/byte[]
. This is by far the fastest throughput.
I attached my wrk results below.
Okay, so what's the conclusion? I think the combination of an expensive serialization mechanism parired with the single-threaded nature of a single connection have a measurable impact. I will do further experiments with multi-threaded scenarios to see how we can improve here.
Using Jackson (original code at https://github.com/jmnght/lettucetest/commit/6f2c221d85f91722106f83b013ecca74217ce0ec)
~ wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
1 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 21.12ms 1.14ms 28.53ms 93.99%
Req/Sec 189.62 6.54 202.00 72.00%
948 requests in 5.02s, 350.21MB read
Requests/sec: 188.94
Transfer/sec: 69.80MB
Write HGETALL content as String:
~ wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
1 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.24ms 0.88ms 13.20ms 85.06%
Req/Sec 553.86 33.64 610.00 74.51%
2812 requests in 5.10s, 1.01GB read
Requests/sec: 551.37
Transfer/sec: 203.75MB
Write HGETALL content as byte[]:
~ wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
1 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.19ms 225.60us 4.82ms 83.02%
Req/Sec 3.25k 190.54 3.44k 88.24%
16511 requests in 5.10s, 5.96GB read
Requests/sec: 3237.87
Transfer/sec: 1.17GB
I introduced a Scheduler
option to Lettuce so data signals are dispatched on ClientResources#eventExecutorGroup()
for now. I think I will make this option configurable via ClientOptions
whether a specific client instance should emit items directly (default) or using the a different scheduler. So basically, we're re-introducing multi-threading with this change whereas a single connection effectively leads to single-threaded-like behavior.
I deployed a snapshot build that emits data signals on a multi-threaded scheduler.
@jmnght Can you give version 5.1.4.gh905-SNAPSHOT
a try and report back whether you see a similar positive effect?
<dependency>
<groupId>io.lettuce</groupId>
<artifactId>lettuce-core</artifactId>
<version>5.1.4.gh905-SNAPSHOT</version>
</dependency>
<repositories>
<repository>
<id>sonatype-snapshots</id>
<name>Sonatype Snapshot Repository</name>
<url>https://oss.sonatype.org/content/repositories/snapshots/</url>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
Emitting data signals directly:
~ wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
1 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 21.80ms 2.06ms 32.30ms 93.14%
Req/Sec 183.64 14.14 202.00 76.00%
918 requests in 5.02s, 339.13MB read
Requests/sec: 183.00
Transfer/sec: 67.60MB
Emitting data signals on a scheduler:
~ wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
1 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.09ms 3.42ms 22.06ms 64.09%
Req/Sec 659.58 124.00 0.96k 72.00%
3287 requests in 5.01s, 0.87GB read
Requests/sec: 656.40
Transfer/sec: 177.22MB
@mp911de Thank you for analyzing it. I will try the SNAPSHOT version and report back. I have a few questions though:
Regarding your questions:
@mp911de I tried the snapshot and the performance improvements I'm seeing are similar to what you are seeing. Number of reqests/second handled is little more than double of what it is currently. Until this fix gets released, I think i will have to have the deserializer wrap the byte[] in a ByteBuffer and return (without actually deseriallizing) so that Lettuce thread is not blocked, unless you have another suggestion. :)
Thanks for reporting back. There's no sensible workaround beyond raw consumption of bytes and handling deserialization in your own code and applying publishOn(Scheduler.parallel())
before deserialization.
@mp911de Thank you for adding this to 5.1.4 milestone. For now, I've changed usage to something like:
redisTemplate.execute(conn -> conn.hashCommands().hGetAll(ByteBuffer.wrap("master-key-proto".getBytes()))) .publishOn(Schedulers.parallel()) .map(Map.Entry::getValue) .map(val -> mapper.readValue(val, Parent.class));
Throughput is close to twice (~400 req/sec) in version 5.1.2, but nowhere close to 3237 req/sec that you were seeing with HGETALL content as byte[]
@mp911de , I integrated this change in the actual application and the improvement was much lower than what I saw in the test application. Could you share the code snippet that gave you 3237 req/sec with byte[] ? (or feel free to commit to jmnght/lettucetest repo) I will also be trying this in the actual production environment and will report back on what I find.
@jmnght I pushed the change to master
and 5.1.x
and you can consume the new artifacts using versions 5.2.0.BUILD-SNAPSHOT
respective 5.1.4.BUILD-SNAPSHOT
.
When using Spring Boot, then you can customize Lettuce via:
@Configuration
class LettuceCustomizer implements LettuceClientConfigurationBuilderCustomizer {
@Override
public void customize(LettuceClientConfigurationBuilder clientConfigurationBuilder) {
clientConfigurationBuilder.clientOptions(ClientOptions.builder().publishOnScheduler(true).build())
}
}
@mp911de Any metrics on performance with the above changes?
I will not be able to share my code here, but I can confirm that my use case is running 4x faster with the above change. I'm heavily using spring data redis in my project.
We've seen already metrics measured from an external system perspective. Throughput scales with the number of utilized cores instead of capping throughput at single-thread (single-core) capacity.
From a JMH perspective, there's a notable penalty, however users have the choice to enable or to disable publish on scheduler:
Direct Publish:
Benchmark Mode Cnt Score Error Units
RedisClientBenchmark.reactiveSetBatch thrpt 5 72636,061 ± 14977,276 ops/s
Use Publish Scheduler:
Benchmark Mode Cnt Score Error Units
RedisClientBenchmark.reactiveSetBatch thrpt 5 56656,378 ± 18703,055 ops/s
I'm closing this ticket as it is resolved now.
@mp911de This change introduces some kind of bug where result is lost. I was able to reproduce this with script execution:
RedisClient client1 = RedisClient.create("redis://localhost");
RedisClient client2 = RedisClient.create("redis://localhost");
client1.setOptions(ClientOptions.builder().publishOnScheduler(false).build());
client2.setOptions(ClientOptions.builder().publishOnScheduler(true).build());
RedisReactiveCommands<String, String> redis1 = client1.connect().reactive();
RedisReactiveCommands<String, String> redis2 = client2.connect().reactive();
int counter1 = 0;
int counter2 = 0;
for (int i = 0; i < 1000; i++) {
if (redis1.eval("return 1", INTEGER).next().block() == null) counter1++;
if (redis2.eval("return 1", INTEGER).next().block() == null) counter2++;
}
out.println(counter1);
out.println(counter2);
Basically, counter1 == counter2 == 0
should be true, but counter2
has value 1-100 instead of 0.
Also, it would be great to be able to specify our own executor, instead of Lettuce's one.
@trueinsider Can you please file a ticket for the bug report.
Re Executor: You can do so by providing an EventExecutorGroup
object.
Motivated by https://stackoverflow.com/questions/53001283/reactive-redis-lettuce-always-publishing-to-single-thread, we should investigate (benchmark) thread switching for response processing.
In a reactive/async scenario, where responses are processed on an I/O thread, there may occur congestion and we should create various scenarios to measure impact and how to address this.