reactor / reactor-netty

TCP/HTTP/UDP/QUIC client/server with Reactor over Netty
https://projectreactor.io
Apache License 2.0
2.6k stars 646 forks source link

The EpollSocketChannel object is too large to be reclaimed by the jvm #3416

Closed userlaojie closed 3 weeks ago

userlaojie commented 2 months ago

Hello, we are revamping our system with spring-webflux. After the service was started in a Linux environment, it was found that the memory kept increasing, and the memory was never reclaimed by the jvm. After pulling the service dump file, we suspected that the webclient connection pool was cross-referenced, resulting in the EpollSocketChannel object not being reclaimed. Please help to check whether there is any problem with webclient configuration, or you can check it from other aspects. Thank you.

This is MAT after analyzing a single object over 80M. image image

This is the jvm memory monitoring usage. image

Steps to Reproduce

The webclient configuration parameters are as follows:

@Slf4j
@Configuration
public class WebClientConfig {
    @Data
    @ConfigurationProperties(prefix = "business.webclient")
    @Configuration
    static class WebClientConnectionConfig {
        private int pendingAcquireTimeout = 50;
        private int maxConnections = 32;
        private int pendingAcquireMaxCount = 1000;
        private long maxIdleTime = 10000;
        private long maxLifeTime = -1;
        private int connectTimeout = 2000;
        private long responseTimeout = 3000;
        private long writeTimeout = 10000;
        private long evictionIntervalTime = 120000;

        @Override
        public String toString() {
            return "WebClientConnectionConfig{" +
                    "pendingAcquireTimeout=" + pendingAcquireTimeout +
                    ", maxConnections=" + maxConnections +
                    ", pendingAcquireMaxCount=" + pendingAcquireMaxCount +
                    ", maxIdleTime=" + maxIdleTime +
                    ", maxLifeTime=" + maxLifeTime +
                    ", connectTimeout=" + connectTimeout +
                    ", responseTimeout=" + responseTimeout +
                    ", writeTimeout=" + writeTimeout +
                    ", evictionIntervalTime=" + evictionIntervalTime +
                    '}';
        }
    }

    @Bean
    public HttpClient httpClient(@Qualifier("webClientConfig.WebClientConnectionConfig")final WebClientConnectionConfig config) throws SSLException {
        log.info("webClientConfig.WebClientConnectionConfig:{}", config.toString());
        ConnectionProvider provider = ConnectionProvider.builder("biz-http-client")
                .pendingAcquireTimeout(Duration.ofMillis(config.getPendingAcquireTimeout()))
                .maxConnections(config.getMaxConnections())
                .maxIdleTime(Duration.ofMillis(config.getMaxIdleTime()))
                .maxLifeTime(Duration.ofMillis(config.getMaxLifeTime()))
                .pendingAcquireMaxCount(config.getPendingAcquireMaxCount())
                .evictInBackground(Duration.ofMillis(config.getEvictionIntervalTime()))
                .build();

        SslContext context = SslContextBuilder.forClient().trustManager(InsecureTrustManagerFactory.INSTANCE).build();

        return HttpClient.create(provider)
                .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, config.getConnectTimeout())
                .responseTimeout(Duration.ofMillis(config.getResponseTimeout()))
                .doOnConnected(conn -> conn.addHandlerLast(new WriteTimeoutHandler(config.getWriteTimeout(), TimeUnit.MILLISECONDS)))
                .secure(t->t.sslContext(context));
    }

    @Bean
    public WebClient webClient(@Qualifier("httpClient")final HttpClient httpClient) {
        WebClient.Builder builder = WebClient.builder();
        DefaultUriBuilderFactory factory = new DefaultUriBuilderFactory();
        factory.setEncodingMode(DefaultUriBuilderFactory.EncodingMode.NONE);

        return builder
                .clientConnector(new ReactorClientHttpConnector(httpClient))
                .codecs(configurer -> {configurer.defaultCodecs()
                        .jackson2JsonEncoder(new Jackson2JsonEncoder(JsonUtils.getMapper()
                                .setSerializationInclusion(JsonInclude.Include.NON_NULL), MediaType.APPLICATION_JSON));
                })
                .defaultHeader("Content-Type", "application/json; charset=UTF-8")
                .uriBuilderFactory(factory)
                .build();
    }
}

Possible Solution

I have two considerations. The first is that ByteBuf references are not released in epoll model, and the second is that webclient connection pool has configuration problems.

Your Environment

userlaojie commented 2 months ago

This is jstat GC statistics, and the number of cgc and ygc is almost the same. image

kzander91 commented 2 months ago

I believe we have the same issue. reactor-netty 1.1.22 Netty 4.1.112 Spring Boot 3.3.3 uname -a: Linux batch-service-794ddfb76-bqnlb 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux java -version:

openjdk version "21.0.4" 2024-07-16 LTS
OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, sharing)

EpollSocketChannel objects are not being garbage-collected: grafik

I looked into some of these instances and they all seem to be referenced by invalidated pooled connections: grafik

Have recent releases changed anything w.r.t. pool entry invalidation? Note that we have not configured the connection pool in any way (reactor.netty.pool.maxIdleTime and friends), so all the defaults should apply.

lfs1985 commented 2 months ago

We have the same issue. Reactor version(s) used:1.1.20 Spring Boot: 3.2.7 JVM version (java -version):OpenJDK 17 EpollSocketChannel is more than 1.27G,but not being garbage-collected。

violetagg commented 1 month ago

All, Please try to provide a reproducible example

userlaojie commented 1 month ago

Ok, we will try to reproduce the scene locally with the jmeter pressure test interface, which will take a day

kzander91 commented 1 month ago

@userlaojie any luck so far? I myself have been unable to reliably reproduce it. The tricky thing is that even in my production application, the leak doesn't always happen. Sometimes it starts leaking until a crash, then, after the reboot, everything is fine for many days.

Considering that in my heap dump, the pool refs are all in STATE_INVALIDATED, maybe its related to connections being closed abnormally?

vitjouda commented 1 month ago

Hi @userlaojie, I strongly believe you came across the same problem as I did, please check my issue if you observe the same behavior. I spent a lot of time trying to simulate it locally, but never managed to produce a reliable reproducer.

userlaojie commented 1 month ago

Sorry, again, we can't replicate this locally. Our latest progress is to remove as many factors as possible that cause http connection unrelease, such as eliminating micrometer usage and not using custom MeterRegistry. The following is the latest monitoring data, some pod memory is still too high: channel-qrcode-pay-7686b6d777-5pjgj image channel-qrcode-pay-6959cf9bb4-b5zst image

vitjouda commented 1 month ago

Hi, I managed to replicate part of the problem and currently discussing it on gitter. If you have the same problem, there are 2 ways how to mitigate it at the moment. Either replace reactor-netty with different, WebClient supported library (we used Apache HttpClient 5, works well), or if you can handle it disable connection keepAlive. In our case, both options eliminate the leak. Of course disabling keepalive is not great and a long-term solution, but you can at least verify if its the same problem. Performance hit will depend on your use case.

violetagg commented 1 month ago

I'm working on a fix for an issue that I found with the reproducer that @vitjouda provided on Gitter

violetagg commented 3 weeks ago

@userlaojie @vitjouda #3459 should be addressing this issue. If you are able to test the snapshot version, it will be great!

vitjouda commented 2 weeks ago

Hi, I am going to deploy the snapshot and let it sit for a day or 2 and report back. Thank you for the fix.

vitjouda commented 1 week ago

Hi again, I tested the snapshot and it looks good! Thank you for the fix :)