ClusterConnectionProvider might cause TCP connection leak

970655147 commented 1 year ago

this is a project using spring-data-redis, spring-boot-actuator there are 2 case, might reoccur that

config address that isn't redis server, follow try a mysql services, then visit "http://localhost:8080/actuator/health", there leak 6 tcp connection each visit spring.redis.cluster.nodes: 192.168.220.133:3306,192.168.220.133:3307,192.168.220.133:3308,192.168.220.133:3309,192.168.220.133:3310,192.168.220.133:3311
config redis server with password, but didn't config "spring.redis.password" then visit "http://localhost:8080/actuator/health", there leak 6 tcp connection each visit spring.redis.cluster.nodes: 192.168.220.133:7001,192.168.220.133:7002,192.168.220.133:7003,192.168.220.133:7004,192.168.220.133:7005,192.168.220.133:7006

use "ll /proc/$pid/fd | grep socket" to watch connections

jxblum commented 1 year ago

Actuator is part of Spring Boot, not Spring Data Redis.

970655147 commented 1 year ago

but core code is in spring-data-redis, main context in LettuceConnectionFactory$SharedConnection.getConnection if ClusterConnectionProvider.getConnection throw Exception, there maybe tcp connection leak spring-boot-acurator is just a trigger in this context

970655147 commented 1 year ago

but core code is in spring-data-redis, main context in LettuceConnectionFactory$SharedConnection.getConnection if ClusterConnectionProvider.getConnection throw Exception, there maybe tcp connection leak spring-boot-acurator is just a trigger in this context

970655147 commented 1 year ago

Actuator is part of Spring Boot, not Spring Data Redis.

sorry, my presentation is not exactly but core code is in spring-data-redis, main context in LettuceConnectionFactory$SharedConnection.getConnection if ClusterConnectionProvider.getConnection throw Exception, there maybe tcp connection leak spring-boot-acurator is just a trigger in this context

jxblum commented 1 year ago

Let me review and evaluate this further.

jxblum commented 1 year ago

Closed by mistake.

jxblum commented 1 year ago

After further analysis, I am not convinced that this is necessarily a problem with Spring Data Redis.

To state this case slightly differently, what would you expect to happen in this case (regarding this statement):

"if ClusterConnectionProvider.getConnection throw Exception, there maybe tcp connection leak spring-boot-acurator is just a trigger in this context"

And specifically, when an Exception is thrown from ClusterConnectionProvider.getConnection()? Close the connection? But, by who/what? How? And, when?

I agree that the RedisConnection should not leak TCP socket resources. And, you could argue that Spring Data Redis should be responsible for closing and cleaning up any resources acquired by the RedisConnection. But, this depends on the "type" of Exception in the first place, in addition to the context (e.g. Spring Boot Actuator), as well as the (application) use case/requirements, and whether any useful information could be derived from the Exception or captured from the state of the RedisConnection both before and after the Exception occurs, which would possibly dictate "how" it is handled, or whether it can or should be handled at all (application dependent).

Typically, a connection (e.g. RedisConnection) to some resource (e.g. Redis server) is established for some purpose, it is handed off to the user (consumer) and the user is left to be responsible for "releasing" that connection, where the "how" and "when" is application and use case dependent. This is not unlike other data access frameworks, such as JDBC. With JDBC, it would be "driver" (vendor) dependent.

You could also argue that 1) Since a RedisConnection is AutoCloseable (Javadoc) and 2) the Spring Boot Actuator HealthIndicator for Redis did not safely acquire a RedisConnection, that it is actually Spring Boot's responsibility.

The logic inside Spring Boot Actuator's RedisHealthIndicator used to acquire connection should technically be:

try (RedisConnection connection = RedisConnectionUtils.getConnection(this.redisConnectionFactory))  {
    doHealthCheck(builder, connection);
}
finally {
    RedisConnectionUtils.releaseConnection(connection, this.redisConnectionFactory);
}

By establishing a RedisConnection outside the try-finally block, when an Exception occurs, the logic inside the finally block (i.e. "release" the RedisConnection) will NOT get executed.

However, unlike JDBC, and even unlike Jedis (though maybe more similar to Reactive), is that opening RedisConnections using the Lettuce driver is "async". This certainly complicates the "when" the connection gets established bit.

Although, in this particular case/context (RedisHealthIndicator for Spring Boot Actuator), we see that the SD Redis RedisConnectUtils.getConnection(:RedisConnectionFactory) method was invoked (here). If you trace this all the way through, we eventually arrive here.

You can see that Lettuce RedisConnections are established (opened) "async", and that SD Redis "joins" on the CompletableFuture to turn an otherwise asynchronous operation into a synchronous, blocking call.

Given the Lettuce driver is returning a "CompletableFuture" (extending CompletionStage) as the wrapper for the RedisConnection (technically, a Lettuce StatefulRedisClusterConnection instance in the "clustered" case and a Lettuce StatefulConnection in the non-clustered case), then it should be possible to maybe pass in an Exception handler (Function) that releases connection resources immediately when a connection fails on open. This, I could possible agree to. However, it does not excuse Spring Boot Actuator from handling the connection properly.

Something like the following, possibly(??):

interface LettuceConnectionProvider {

    default <T extends StatefulConnection<?, ?>> T getConnection(Class<T> connectionType) {

        BiFunction<T, Throwable, T> handleExceptionOnConnectionOpen = (connection cause) -> {
            try {
                if (cause instanceof <WhatTypeOfRuntimeExceptionHere>) {
                    RedisConnectionUtils.releaseConnection(connection);
                    throw cause;
                }

                return connection;
            }
            catch (Throwable ignore) {
            }
        } ;

        return LettuceFutureUtils.join(getConnectionAsync(connectionType)
            .handleAsync(handleExceptionOnConnectionOpen));
    }
}

jxblum commented 1 year ago

Other factors to consider include, but are not limited to:

1) Whether a "Pool" of connections has been configured, and... 2) Whether it is possible, or whether any SocketOptions have been supplied (indirectly) to the LettuceClientConfiguration (from here to here).

Typically, a connection "Pool" is responsible for acquiring and releasing connections along with their associated resources (TCP sockets, etc), especially for "failed" connections, even on "open" no less.

I'd almost argue that your Spring Boot (Data Redis) application should be failing-fast, particularly if the authentication criteria has not been properly specified and configured. Why should the application even be allowed to start much less allow users to access the Spring Boot Actuator HealthIndicator for Redis to reveal anything? Of course, if Redis is not used as your application's primary data store (you do mention MySQL) and only as a cache, then this would make sense.

Spring Data Redis does offer and allow (disabled by default) validation of RedisConnections on startup.

Finally, and arguably, this could be something the (Lettuce) Redis driver(s) should be handling specifically, especially as it pertains to low-level resource (like TCP sockets), given drivers might do different things depending on the underlying OS, no less.

jxblum commented 1 year ago

In any case, I want to get @mp911de or @christophstrobl's thoughts and opinions on this before we proceed with any changes to the framework.

mp911de commented 1 year ago

if ClusterConnectionProvider.getConnection throw Exception, there maybe tcp connection leak

Do you have an example exception @970655147 that causes connection leaks for you?

970655147 commented 1 year ago

leaks

Hello, reoccur like that new simple spring-boot project as follow

pom.xml include spring-boot-starter-data-redis `
org.springframework.boot spring-boot-starter-data-redis
`
application.yml config redis info, following config is mysql & kafka[just for test] spring: redis: cluster: nodes: "192.168.0.105:3306,192.168.0.105:9092" timeout: "PT3S"
add case code in Application of spring-boot ` package com.hx.boot;

import com.hx.boot.context.SpringContext; import org.mybatis.spring.annotation.MapperScan; import org.springframework.boot.CommandLineRunner; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.data.redis.connection.ReactiveRedisConnection; import org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory; import org.springframework.scheduling.annotation.EnableAsync;

@EnableAsync @MapperScan("com.hx.boot.mapper") @SpringBootApplication public class HelloSpringBootApplication implements CommandLineRunner {

public static void main(String[] args) {
    SpringApplication.run(HelloSpringBootApplication.class, args);
}

@Override
public void run(String... args) throws Exception {
    LettuceConnectionFactory factory = SpringContext.getBean(LettuceConnectionFactory.class);
    for (int i = 0; i < 100; i++) {
        try {
            ReactiveRedisConnection connection = factory.getReactiveConnection();
        } catch (Exception e) {
            // ignore
        }
    }
    int x = 0;
}

} `

watch leaks info root@ubuntu:~# ll /proc/5795/fd | grep 41 lrwx------ 1 root root 64 Jun 29 06:41 103 -> socket:[328474] lrwx------ 1 root root 64 Jun 29 06:41 104 -> socket:[328475] lrwx------ 1 root root 64 Jun 29 06:41 105 -> socket:[320448] // ignore some file descriptor lrwx------ 1 root root 64 Jun 29 06:41 322 -> socket:[328586] lrwx------ 1 root root 64 Jun 29 06:41 323 -> socket:[323163] lrwx------ 1 root root 64 Jun 29 06:41 324 -> socket:[328587] lrwx------ 1 root root 64 Jun 29 06:41 325 -> socket:[320477]

if ClusterConnectionProvider.getConnection throw Exception, there maybe tcp connection leak

Do you have an example exception @970655147 that causes connection leaks for you?

jxblum commented 1 year ago

@970655147 - First of all, you "ignored" the very thing (i.e. resulting "Exception") we asked you to provide (in detail) in the first place:

  // get connection
} catch (Exception e) {
            // ignore
}

Regarding:

"if ClusterConnectionProvider.getConnection throw Exception, there maybe tcp connection leak"

What Exception?

By ignoring the Exception, you are implying any Exception, and it does not matter. But, it does! It does because it matters what happens during the "process" of opening the connection.

An Exception resulting from a username/password auth error is different from an Exception occurring because an incorrect IP address or port to the server was configured/resolved is different from an Exception thrown because of a failed TLS handshake (verification) during SSL/TLS identification & authentication (as described by IBM, for instance) is different from other TCP errors, and so on.

In the later case, as with username/password-based auth, a TCP connection is still established and therefore system resources will be allocated and used to complete the steps of the TLS handshake/verification (over SSL/TLS). After all, Secure Sockets, or a "secure" Socket, like SSLSocket, is nothing more than encrypted data over an otherwise "plaintext" Socket (i.e. data must be "encrypted" before it is sent over a Socket/network connection to be "secure"; there is no such thing as a hardware "secured" Socket; it is all "software").

This is also OS/system dependent.

As such, this why (TCP) Sockets have configuration (such as SO_LINGER, among other Socket configuration options, timeouts, reuse, etc).

Secondarily, this code:

    LettuceConnectionFactory factory = SpringContext.getBean(LettuceConnectionFactory.class);
    for (int i = 0; i < 100; i++) {
        try {
            ReactiveRedisConnection connection = factory.getReactiveConnection();
...
..
.

Does not necessarily constitute a connection leak simply because an Exception might be thrown during an attempt to acquire a connection. You are simply iterating and opening connections without proper connection "handling". And, if it were unbounded, your problem would only exasberate the "perceived" problem.

Most file-based system resources (TCP Socket connections are no exception; see here), must be properly handled inside a block, such as the following (using your example):

    for (int i = 0; i < 100; i++) {
        try  (ReactiveRedisConnection connection = factory.getReactiveConnection()) {
        }
        catch (Exception cause) {
            // handle the exceptional condition appropriately; such as releasing/closing related resources and 
        }

Just because you attempt to establish a connection and 1) an Exception is thrown or 2) the block of code opening/using the connection falls out of scope, does not mean the framework, or any other facility for that matter, will automatically (or automagically) "close" the connection. And, even if "close" is called, it does not mean system resources will be immediately released (again, it depends on the Socket configuration using a factory when the Socket is created/opened).

It is always the user's responsibility to properly handle the connection, whether you are opening a Redis connection (regardless of type, e.g. Reactive), a (SQL) database connection (e.g. with JDBC), or simply opening a file.

For example:

try (FileReader fileReader = new FileReader(file)) {
    // do something with the "open" file
}

If not properly handled, there is no guarantee that systems resources (even in the file case) will be properly released. Again, mileage varies based on OS/system, among other factors.

Also see this SO post as further evidence to this case.

jxblum commented 1 year ago

1 last thing...

This is also why it is important to use some connection management facility, such as "pooling", which Spring Data Redis supports for both the Jedis and Lettuce drivers (see here).

In addition, the closer you get to system controlled resources, like TCP connections, the more it matters what type of driver you are using, some of which can be system dependent, even. It is also why certain drivers perform better or behave slightly differently on different platforms, especially given no 2 things (like OSes) are equal in this regard.

This could also arguably be a driver responsibility and concern more than a framework-level concern. Frameworks like Spring Data Redis are not in the business of managing connections, rather, only acquiring and using them for some higher purpose, like the data access patterns for querying, etc.

And, in this case, it is allowing your application code to acquire a connection "directly" (I might add) for some low-level purpose, and as such, your application code becomes responsible for that resource. It would be different if the framework were acquiring/opening a connection on your behalf, such as would be the case in Repository.save(entity) or a Repository.findByXyz(..) call.

970655147 commented 1 year ago

@970655147 - First of all, you "ignored" the very thing (i.e. resulting "Exception") we asked you to provide (in detail) in the first place:
  // get connection
} catch (Exception e) {
            // ignore
}
Regarding:

"if ClusterConnectionProvider.getConnection throw Exception, there maybe tcp connection leak"

What Exception?

By ignoring the Exception, you are implying any Exception, and it does not matter. But, it does! It does because it matters what happens during the "process" of opening the connection.

An Exception resulting from a username/password auth error is different from an Exception occurring because an incorrect IP address or port to the server was configured/resolved is different from an Exception thrown because of a failed TLS handshake (verification) during SSL/TLS identification & authentication (as described by IBM, for instance) is different from other TCP errors, and so on.

In the later case, as with username/password-based auth, a TCP connection is still established and therefore system resources will be allocated and used to complete the steps of the TLS handshake/verification (over SSL/TLS). After all, Secure Sockets, or a "secure" Socket, like SSLSocket, is nothing more than encrypted data over an otherwise "plaintext" Socket (i.e. data must be "encrypted" before it is sent over a Socket/network connection to be "secure"; there is no such thing as a hardware "secured" Socket; it is all "software").

This is also OS/system dependent.

As such, this why (TCP) Sockets have configuration (such as SO_LINGER, among other Socket configuration options, timeouts, reuse, etc).

Secondarily, this code:
    LettuceConnectionFactory factory = SpringContext.getBean(LettuceConnectionFactory.class);
    for (int i = 0; i < 100; i++) {
        try {
            ReactiveRedisConnection connection = factory.getReactiveConnection();
...
..
.
Does not necessarily constitute a connection leak simply because an Exception might be thrown during an attempt to acquire a connection. You are simply iterating and opening connections without proper connection "handling". And, if it were unbounded, your problem would only exasberate the "perceived" problem.

Most file-based system resources (TCP Socket connections are no exception; see here), must be properly handled inside a block, such as the following (using your example):
    for (int i = 0; i < 100; i++) {
        try  (ReactiveRedisConnection connection = factory.getReactiveConnection()) {
        }
        catch (Exception cause) {
            // handle the exceptional condition appropriately; such as releasing/closing related resources and 
        }
Just because you attempt to establish a connection and 1) an Exception is thrown or 2) the block of code opening/using the connection falls out of scope, does not mean the framework, or any other facility for that matter, will automatically (or automagically) "close" the connection. And, even if "close" is called, it does not mean system resources will be immediately released (again, it depends on the Socket configuration using a factory when the Socket is created/opened).

It is always the user's responsibility to properly handle the connection, whether you are opening a Redis connection (regardless of type, e.g. Reactive), a (SQL) database connection (e.g. with JDBC), or simply opening a file.

For example:
try (FileReader fileReader = new FileReader(file)) {
    // do something with the "open" file
}
If not properly handled, there is no guarantee that systems resources (even in the file case) will be properly released. Again, mileage varies based on OS/system, among other factors.

Also see this SO post as further evidence to this case.

following code based on spring-data-redis 2.0.5-RELEASE

1. What Exception?

"Do you have an example exception", I have a wrong reading, I make sense of "example case" exception as following, throw ex2 from lettuce, then spring-data-redis wrap ex1

ex1. getNativeConnection:964, LettuceConnectionFactory$SharedConnection (org.springframework.data.redis.connection.lettuce): "Unable to connect to Redis", and hold cause as follow
ex2. loadPartitions:790, RedisClusterClient (io.lettuce.core.cluster) - io.lettuce.core.RedisException: Cannot retrieve initial cluster partitions from initial URIs [RedisURI [host='192.168.0.106', port=3306], RedisURI [host='192.168.0.106', port=9092]]

2. with out exception handling on example case

my presentation is not imprecisely "factory.getReactiveConnection();" throw ex, so I just ignore that for casualness

3. my first idea to proposal to spring-data-redis

the call stack trace in this case is like following

getConnection:55, ClusterConnectionProvider (org.springframework.data.redis.connection.lettuce)
getNativeConnection:957, LettuceConnectionFactory$SharedConnection (org.springframework.data.redis.connection.lettuce)
getConnection:932, LettuceConnectionFactory$SharedConnection (org.springframework.data.redis.connection.lettuce)
getSharedReactiveConnection:795, LettuceConnectionFactory (org.springframework.data.redis.connection.lettuce)
getReactiveConnection:329, LettuceConnectionFactory (org.springframework.data.redis.connection.lettuce)

if ex "Unable to connect to Redis" thrown from LettuceConnectionFactory$SharedConnection.getNativeConnection, so this.connection is always null an in next, nextnext, .. call, it will connect to redis-server over and over again

LettuceConnectionFactory$SharedConnection.getConnection
StatefulConnection<E, E> getConnection() {
    synchronized (this.connectionMonitor) {
        if (this.connection == null) {
            this.connection = getNativeConnection();
        }
        if (getValidateConnection()) {
            validateConnection();
        }
        return this.connection;
    }
}

4. lettuce's RedisClusterClient.loadPartitions processing

I was wondering following processing handle "getConnections(seed)" established connection, but "connections.close()" doesn't release that

loadPartitions:787, RedisClusterClient (io.lettuce.core.cluster)

public Map<RedisURI, Partitions> loadViews(Iterable<RedisURI> seed, boolean discovery) {
    long commandTimeoutNs = getCommandTimeoutNs(seed);
    Connections connections = null;
    try {
        connections = getConnections(seed).get(commandTimeoutNs, TimeUnit.NANOSECONDS);
        // ignore some code
        return nodeSpecificViews.toMap();
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        throw new RedisCommandInterruptedException(e);
    } finally {
        if (connections != null) {
            connections.close();
        }
    }
}

but I'm not sure, I'm not experienced, the "NioSocketChannel" is only increase without decrease until "too many open files"

root@ubuntu:~# jmap -histo:live 6970 | grep NioSocketChannel
  38:           558          58032  io.netty.channel.socket.nio.NioSocketChannel
  66:           558          35712  io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelConfig
  99:           558          22320  io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe

jxblum commented 1 year ago

Regarding #3 above, and specifically, this statement:

"an in next, nextnext, .. call, it will connect to redis-server over and over again"

"What" will connect to redis-server over and over again?

There is nothing in the call stack to which your refer that repeatedly (i.e. loops, and) attempts to establish a connection until (implied) successful (e.g. non-null).

Using Spring Data Redis 2.7.13, you can navigate the call stack as follows (let's assume shared, clustered connections since you often refer to a "cluster" and is the subject of this ticket), then:

NOTE: Your line numbers do not line up. And, you somehow mixed clustered with non-clustered calls in your stack. This does not make sense. Never-the-less, it ultimately leads to a similar call stack since the difference between clustered and non-clustered depends on the Lettuce RedisClient configured and created when the LettuceConnectionFactory bean is "initialized".

caused by: ... 108: ClusterConnectionProvider.getConnectionAsync(:Class) 53: LettuceConnectionProvider.getConnection(:Class) 1383: LettuceConnectionFactory.SharedConnection.getNativeConnection() 1366: LettuceConnectionFactory.SharedConnection.getConnection() 1117: LettuceConnectionFactory.getSharedReactiveConnection() 529: LettuceConnectionFactory.getReactiveClusterConnection() 505: LettuceConnectionFactory.getReactiveConnection(..) ... .. .

If you are using "pooling" (highly recommended in this case), then the only difference would be an additional call between 1383 and 53, on line 97 in the LettucePoolingConnectionProvider.getConnection(:Class).

NOTE: A LettucePoolingConnectonProvider simply "wraps" the underlying, actual LettuceConnectionProvider implementation, such as the ClusterConnectionProvider, to "pool" connections. The pool configuration can also "limit" the number of connections so they do not grow unbounded, and your run out of system resources (like file descriptors).

In any case, this is very much dependent on the caller.

In your specific case, that would be the Spring Boot RedisHealthIndicator, which, TMK, is only triggered when you trip the URL endpoint exposed by Spring Boot Actuator to access the "health" of your Redis cluster (Redis servers in the cluster). There would need to be some "active" monitoring for the RedisHealthIndicator to repeatedly attempt to open connections and exhaust your "unbounded" (non-pooled) resources.

Spring Data Redis cannot possibly know all the ways in which callers will attempt to acquire and use connections. Again, outside of RedisTemplate operations or Spring Data Redis Repository infrastructure, which might acquire connections on the user's (or caller's) behalf, then if the caller or your application (client) code acquires a connection, it is the responsibility of the caller or your application logic to properly manage the connection.

You have 1 of 3 options, here:

1) Increase the hard and soft limits of your File Descriptors (FDs) in your Linus-based system. For example.

2) Configure and use connection pooling (recommended).

3) Fix your underlying Exception: Cannot retrieve initial cluster partitions from initial URIs [RedisURI [host='192.168.0.106', port=3306], RedisURI [host='192.168.0.106', port=9092]].

I strongly encourage you to consider #2 and investigate #3. You clearly have misconfigured your environment in some way or have some other (network) environment or Redis problem. Once control is down inside line 108, you are at the mercy of the Redis driver (Lettuce).

jxblum commented 1 year ago

Regarding #4:

I gather, the assessment of the Redis partitions distributed across the clutter is triggered by Spring Data Redis, here, and specifically.

Again, this is very much Redis driver based (Lettuce, in this instance), and largely outside the control of Spring Data Redis.

Also, the Lettuce driver code does not strike me as incorrect. It properly wraps the connection in a try-catch-finally block, attempting to close the connection if any kind of Throwable is thrown. What ultimately happens to the underlying system resources (for example, like Sockets) is runtime and system dependent.

spring-projects / spring-data-redis