Open ajzach opened 3 weeks ago
@ajzach Please elaborate on the following:
Hi @ikolomi
@ajzach I am trying to reproduce, meanwhile w/o success - how do you trigger the failover? aws elasticache test-failover ?
@ajzach It seems that we will need more info on your case since i am not able to reproduce the problem. What i did
./gradlew run --args="--resultsFile=output --dataSize 16 --concurrentTasks 8 --clients glide --host $CLUSTER_ENDPOINT --clientCount 400 --clusterModeEnabled --minimal 20000000 --tls"
aws elasticache test-failover --replication-group-id $CLUSTER_ID --node-group-id "0001"
Next steps We will need some more details on your workload:
Also, can you share your code? (at least the portions that deal with error handling and reconnects) It could be really helpful to reproduce the issue
GlideClusterClientConfiguration.GlideClusterClientConfigurationBuilder<?, ?> configBuilder =
GlideClusterClientConfiguration.builder()
.address(NodeAddress.builder().host(host).port(conf.getPort()).build())
.useTLS(true)
.readFrom(ReadFrom.PRIMARY)
.requestTimeout(2000);
ServerCredentials credentials =
ServerCredentials.builder().username("user").password("pass").build();
configBuilder.credentials(credentials);
return GlideClusterClient.createClient(configBuilder.build()).get();
Let me see if I understand - your application that acts as a proxy, it has 200 threads that serve clients requests which are tunneled over a single Glide connection? at 400K TPS and 4 instances , each instance serves 100K TPS?
Exactly, a single Glide client instantiated per node, with the default configuration in Spring Boot (according to the documentation, 200 threads by default), 4 instances in total, 400k rpm in total, 100k rpm per instance.
No, we catch the exceptions but do not recreate the connections; we always use the same client.
Is it possible that the client is closed during failover and all you need to do is re-create the client? You can follow examples to see how to handle exception handling and re-create the client if is loses connection. This could happen during failover, and all you need to do is re-establish the client connection.
Could you check what application is causing the CPU spike? My concern is that your 400K requests (by 4 separate nodes) are being pushed asynchronously to the client and each has a request timeout of 2 seconds. This could cause a CPU spike on the client-side, while it waits for the failover to complete.
100k rpm per instance
Can you describe a little how you are setting up 100k command requests? What does rpm mean? Requests per minute?
Is it possible that the client is closed during failover and all you need to do is re-create the client? You can follow examples to see how to handle exception handling and re-create the client if is loses connection. This could happen during failover, and all you need to do is re-establish the client connection.
@acarbonetto The client should not close during failovers; if it does, this likely indicates a bug. Glide is designed to manage connection errors internally. If the client closes, it suggests a significant underlying issue. Have you seen that the client is being closed in your reproduction?
Can you describe a little how you are setting up 100k command requests? What does rpm mean? Requests per minute?
We have a tool that allows us to generate traffic, rpm = requests per minute.
Could you check what application is causing the CPU spike? My concern is that your 400K requests (by 4 separate nodes) are being pushed asynchronously to the client and each has a request timeout of 2 seconds. This could cause a CPU spike on the client-side, while it waits for the failover to complete.
It is correct, there is an increase in CPU usage due to the failover, but it does not reach 50% and it lasted only a few seconds. We use instances with 8 cores and 30GB of RAM. The problem is that the same application using Lettuce under the same conditions did not have issues.
With regard to creating the client again, I agree with @barshaul that it's something the client should handle internally.
I'm available to take any test you need.
the instances experience a CPU spike and stop responding
Are you observing TimeoutExceptions like https://github.com/valkey-io/valkey-glide/blob/main/examples/java/src/main/java/glide/examples/ClusterExample.java#L124-L128?
@ajzach ,
We have not tested yet with Spring boot. Can you please share more code samples for us to recreate. Maybe the fact that there are 200 threads causes some thread starvation. If you can share the code or some more details on your spring application we will recreate.
I don't think that the load is the issue as @ikolomi tested 9,000,000 RPM.
Hello @asafpamzn, I am going to run tests with a limited number of threads to see if that is the issue. As I mentioned above, under the same conditions, Lettuce isn't causing any problems.
I ran some tests, and it only worked correctly with 32 threads. Using 40 threads, the application crashes, even when lowering the timeout to 500ms.
Thanks a lot @ajzach for helping us to improve GLIDE,
The problem might be that the GLIDE core (rust) thread is starved. @acarbonetto can you please check it out? Try to recreate with Spring and many threads.
@ajzach, is there a way for you to monitor the rust core threads and maybe give the thread ID real time priority (if you are using Linux).
The best way for us to reproduce is to get some code samples and better understanding of the env.
@acarbonetto can you please guide @ajzach
@ajzach we configured a custom thread executor for performance testing, that re-queues blocked threads. That way you don't need to limit the number of threads.
Would you please be able to try this and let me know if it works?
ExecutorService executor =
new ThreadPoolExecutor(
0,
Integer.MAX_VALUE,
60L,
TimeUnit.SECONDS,
new SynchronousQueue<Runnable>(),
(r, poolExecutor) -> {
if (!poolExecutor.isShutdown()) {
try {
poolExecutor.getQueue().put(r);
} catch (InterruptedException e) {
throw new RuntimeException("interrupted");
}
}
});
@ajzach , in order to speed up the process we will be happy to meet and schedule a debug session. Sadly we cannot reproduce the issue.
Describe the bug
There is a problem when performing a failover in Elasticache; the instances experience a CPU spike and stop responding. I have an application that is essentially a proxy, which consists of 4 m6g.2xlarge instances. In my first test with 400k RPM, all instances stopped responding when the failover was executed. With 100k RPM, 3 out of 5 instances stopped responding. With 400k rpm, the CPU of the instances (before the failover) was at 30%, with 100k rpm at 10%. None of the instances had memory issues.
I performed the same test using Lettuce (https://github.com/redis/lettuce), and although there is also a spike in CPU usage, all instances continued to function correctly.
Expected Behavior
The instances must be able to continue handling the requests.
Current Behavior
The instances experience a CPU spike and stop responding to the health check; therefore, they are replaced.
Reproduction Steps
Translate to English:
Having an application that is basically a proxy, with 400k RPM, and performing a failover in ElastiCache.
Possible Solution
No response
Additional Information/Context
No response
Client version used
Java 1.0.1
Engine type and version
Redis 6.2.6
OS
Linux
Language
Python
Language Version
Java 17
Cluster information
Cluster mode: one node with replica
Logs
No response
Other information
In "Language," select Python, but the application is in Java (there is no option to select Java).