Issues with the client when performing a failover in ElastiCache

ajzach commented 3 weeks ago

Describe the bug

There is a problem when performing a failover in Elasticache; the instances experience a CPU spike and stop responding. I have an application that is essentially a proxy, which consists of 4 m6g.2xlarge instances. In my first test with 400k RPM, all instances stopped responding when the failover was executed. With 100k RPM, 3 out of 5 instances stopped responding. With 400k rpm, the CPU of the instances (before the failover) was at 30%, with 100k rpm at 10%. None of the instances had memory issues.

I performed the same test using Lettuce (https://github.com/redis/lettuce), and although there is also a spike in CPU usage, all instances continued to function correctly.

Expected Behavior

The instances must be able to continue handling the requests.

Current Behavior

The instances experience a CPU spike and stop responding to the health check; therefore, they are replaced.

Reproduction Steps

Translate to English:

Having an application that is basically a proxy, with 400k RPM, and performing a failover in ElastiCache.

Possible Solution

No response

Additional Information/Context

No response

Client version used

Java 1.0.1

Engine type and version

Redis 6.2.6

OS

Linux

Language

Python

Language Version

Java 17

Cluster information

Cluster mode: one node with replica

Logs

No response

Other information

In "Language," select Python, but the application is in Java (there is no option to select Java).

ikolomi commented 3 weeks ago

@ajzach Please elaborate on the following:

ElastiCache cluster details - Number of shards, replicas, type of instances, multiAZ, TLS, etc?
Workload details - types of commands, key space, data sizes, etc?

ajzach commented 3 weeks ago

Hi @ikolomi

1 shard with 1 replica, cluster mode, cache.m6g.large and TLS
It always executed the same command with TTL 60s: SET key1 xxxxxxxx

ikolomi commented 3 weeks ago

@ajzach I am trying to reproduce, meanwhile w/o success - how do you trigger the failover? aws elasticache test-failover ?

ikolomi commented 3 weeks ago

@ajzach It seems that we will need more info on your case since i am not able to reproduce the problem. What i did

Created an EC cluster with one shard and one replica. Version: 6.2.6, TLS, type: cache.m6g.large
Used c5.xlarge as a loader instance
Modified JAVA benchmark app (java/benchmarks/src/main/java/glide/benchmarks/BenchmarkingApp.java) to do only SETS with the params as you explained above and accept the number of commands via --minimal param

Run the benchmark using with 400 connections over 8 threads, achieving ~150K TPs

./gradlew run --args="--resultsFile=output --dataSize 16 --concurrentTasks 8 --clients glide --host $CLUSTER_ENDPOINT --clientCount 400 --clusterModeEnabled --minimal 20000000 --tls"

Triggered failover using:

aws elasticache test-failover --replication-group-id $CLUSTER_ID --node-group-id "0001"

Observed the benchmark app completed w/o increasing the CPU
Rerun benchmark, during failover number of times, no CPU hogging, instance is responsive
Rerun benchmark after the failover complete, no CPU hogging, instance is responsive. Got the same TPS as from step (4)

Next steps We will need some more details on your workload:

Per instance, how many connections do you create
Per instance, how many processes/threads use the connections
What are the setting for the connections (GlideClientConfiguration)
How does the code handles exceptions? Are the connections recreated?

Also, can you share your code? (at least the portions that deal with error handling and reconnects) It could be really helpful to reproduce the issue

ajzach commented 3 weeks ago

One connection/client per instance
It is a Spring application, I use the default configuration, I assume there are 200 threads.

GlideClusterClientConfiguration.GlideClusterClientConfigurationBuilder<?, ?> configBuilder =
        GlideClusterClientConfiguration.builder()
            .address(NodeAddress.builder().host(host).port(conf.getPort()).build())
            .useTLS(true)
            .readFrom(ReadFrom.PRIMARY)
            .requestTimeout(2000);

      ServerCredentials credentials =
          ServerCredentials.builder().username("user").password("pass").build();
      configBuilder.credentials(credentials);

      return GlideClusterClient.createClient(configBuilder.build()).get();

No, we catch the exceptions but do not recreate the connections; we always use the same client.

ikolomi commented 3 weeks ago

Let me see if I understand - your application that acts as a proxy, it has 200 threads that serve clients requests which are tunneled over a single Glide connection? at 400K TPS and 4 instances , each instance serves 100K TPS?

ajzach commented 3 weeks ago

Exactly, a single Glide client instantiated per node, with the default configuration in Spring Boot (according to the documentation, 200 threads by default), 4 instances in total, 400k rpm in total, 100k rpm per instance.

acarbonetto commented 2 weeks ago

No, we catch the exceptions but do not recreate the connections; we always use the same client.

Is it possible that the client is closed during failover and all you need to do is re-create the client? You can follow examples to see how to handle exception handling and re-create the client if is loses connection. This could happen during failover, and all you need to do is re-establish the client connection.

acarbonetto commented 2 weeks ago

Could you check what application is causing the CPU spike? My concern is that your 400K requests (by 4 separate nodes) are being pushed asynchronously to the client and each has a request timeout of 2 seconds. This could cause a CPU spike on the client-side, while it waits for the failover to complete.

acarbonetto commented 2 weeks ago

100k rpm per instance

Can you describe a little how you are setting up 100k command requests? What does rpm mean? Requests per minute?

barshaul commented 2 weeks ago

Is it possible that the client is closed during failover and all you need to do is re-create the client? You can follow examples to see how to handle exception handling and re-create the client if is loses connection. This could happen during failover, and all you need to do is re-establish the client connection.

@acarbonetto The client should not close during failovers; if it does, this likely indicates a bug. Glide is designed to manage connection errors internally. If the client closes, it suggests a significant underlying issue. Have you seen that the client is being closed in your reproduction?

ajzach commented 2 weeks ago

Can you describe a little how you are setting up 100k command requests? What does rpm mean? Requests per minute?

We have a tool that allows us to generate traffic, rpm = requests per minute.

Could you check what application is causing the CPU spike? My concern is that your 400K requests (by 4 separate nodes) are being pushed asynchronously to the client and each has a request timeout of 2 seconds. This could cause a CPU spike on the client-side, while it waits for the failover to complete.

It is correct, there is an increase in CPU usage due to the failover, but it does not reach 50% and it lasted only a few seconds. We use instances with 8 cores and 30GB of RAM. The problem is that the same application using Lettuce under the same conditions did not have issues.

With regard to creating the client again, I agree with @barshaul that it's something the client should handle internally.

I'm available to take any test you need.

acarbonetto commented 2 weeks ago

the instances experience a CPU spike and stop responding

Are you observing TimeoutExceptions like https://github.com/valkey-io/valkey-glide/blob/main/examples/java/src/main/java/glide/examples/ClusterExample.java#L124-L128?

asafpamzn commented 1 week ago

@ajzach ,

We have not tested yet with Spring boot. Can you please share more code samples for us to recreate. Maybe the fact that there are 200 threads causes some thread starvation. If you can share the code or some more details on your spring application we will recreate.

I don't think that the load is the issue as @ikolomi tested 9,000,000 RPM.

ajzach commented 1 week ago

Hello @asafpamzn, I am going to run tests with a limited number of threads to see if that is the issue. As I mentioned above, under the same conditions, Lettuce isn't causing any problems.

ajzach commented 1 week ago

I ran some tests, and it only worked correctly with 32 threads. Using 40 threads, the application crashes, even when lowering the timeout to 500ms.

asafpamzn commented 6 days ago

Thanks a lot @ajzach for helping us to improve GLIDE,

The problem might be that the GLIDE core (rust) thread is starved. @acarbonetto can you please check it out? Try to recreate with Spring and many threads.

@ajzach, is there a way for you to monitor the rust core threads and maybe give the thread ID real time priority (if you are using Linux).

The best way for us to reproduce is to get some code samples and better understanding of the env.

@acarbonetto can you please guide @ajzach

acarbonetto commented 5 days ago

@ajzach we configured a custom thread executor for performance testing, that re-queues blocked threads. That way you don't need to limit the number of threads.

Would you please be able to try this and let me know if it works?

see: https://github.com/valkey-io/valkey-glide/blob/main/java/benchmarks/src/main/java/glide/benchmarks/utils/Benchmarking.java#L116-L131

            ExecutorService executor =
                    new ThreadPoolExecutor(
                            0,
                            Integer.MAX_VALUE,
                            60L,
                            TimeUnit.SECONDS,
                            new SynchronousQueue<Runnable>(),
                            (r, poolExecutor) -> {
                                if (!poolExecutor.isShutdown()) {
                                    try {
                                        poolExecutor.getQueue().put(r);
                                    } catch (InterruptedException e) {
                                        throw new RuntimeException("interrupted");
                                    }
                                }
                            });

asafpamzn commented 2 days ago

@ajzach , in order to speed up the process we will be happy to meet and schedule a debug session. Sadly we cannot reproduce the issue.

valkey-io / valkey-glide