reactor / reactor-netty

TCP/HTTP/UDP/QUIC client/server with Reactor over Netty
https://projectreactor.io
Apache License 2.0
2.58k stars 645 forks source link

Connection reset by peer exception #1774

Closed RitikaDangal closed 3 years ago

RitikaDangal commented 3 years ago

We have a micro service based spring boot architecture where we are using spring webclient (which internally uses reactor netty) for internal communication between services. The issue that we faced on production was, we were getting random "connection reset by peer" exception in our services. No logs for the same request could be found in the called service. This is how we were initialising our webclient earlier:

webClient = WebClient.builder().build();

To fix the same, we diabled connection pooling and initialised our webclient as below, post that the same exception was fixed.

webClient = WebClient.builder().clientConnector(new ReactorClientHttpConnector(HttpClient.newConnection())).build();

But how can we fix the same with connection pooling enabled as disabling connection pooling comes with its own disadvantages?

Reactor Netty version: 1.0.9 Spring boot version: 2.5.3

Exception:

2021-08-16 12:20:24,095 WARN [reactor-http-epoll-1] reactor.util.Loggers$Slf4JLogger: [id:04a24430-45, L:/10.0.8.88:33848 - R:172.20.0.20/172.20.0.20:3148] The connection observed an error, the request cannot be retried as the headers/body were sent io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2021-08-16 12:20:24,100 ERROR [reactor-http-epoll-1] reactor.util.Loggers$Slf4JLogger: Operator called default onErrorDropped reactor.core.Exceptions$ErrorCallbackNotImplemented: org.springframework.web.reactive.function.client.WebClientRequestException: readAddress(..) failed: Connection reset by peer; nested exception is io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer Caused by: org.springframework.web.reactive.function.client.WebClientRequestException: readAddress(..) failed: Connection reset by peer; nested exception is io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141) Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: Error has been observed at the following site(s): |_ checkpoint ⇢ Request to GET http://172.20.0.20:3148/v1/users/referral/ec148ff3-5dd9-473f-a7f0-cb180a5e21f0 [DefaultWebClient] Stack trace: at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141) at reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55) at reactor.core.publisher.Mono.subscribe(Mono.java:4338) at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103) at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222) at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222) at reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222) at reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:93) at reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:204) at reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124) at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.whenError(FluxRetryWhen.java:225) at reactor.core.publisher.FluxRetryWhen$RetryWhenOtherSubscriber.onError(FluxRetryWhen.java:274) at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.drain(FluxConcatMap.java:414) at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.onNext(FluxConcatMap.java:251) at reactor.core.publisher.EmitterProcessor.drain(EmitterProcessor.java:491) at reactor.core.publisher.EmitterProcessor.tryEmitNext(EmitterProcessor.java:299) at reactor.core.publisher.SinkManySerialized.tryEmitNext(SinkManySerialized.java:100) at reactor.core.publisher.InternalManySink.emitNext(InternalManySink.java:27) at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.onError(FluxRetryWhen.java:190) at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:189) at reactor.netty.http.client.HttpClientConnect$HttpObserver.onUncaughtException(HttpClientConnect.java:384) at reactor.netty.ReactorNetty$CompositeConnectionObserver.onUncaughtException(ReactorNetty.java:647) at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onUncaughtException(DefaultPooledConnectionProvider.java:219) at reactor.netty.resources.DefaultPooledConnectionProvider$PooledConnection.onUncaughtException(DefaultPooledConnectionProvider.java:467

violetagg commented 3 years ago

@RitikaDangal Please capture the traffic with Wireshark and share it. Is it possible that some network component (e.g. firewall etc.) closes the connection because of inactivity? If you configure maxIdleTime for the connection pool, do you see the issue? (https://projectreactor.io/docs/netty/release/reference/index.html#connection-pool-timeout)

RitikaDangal commented 3 years ago

@violetagg I have tried capturing the traffic using Wireshark but did not see anything there. All was at network layer. Will configure maxIdleTime and monitor for a day or two. Thanks

violetagg commented 3 years ago

@RitikaDangal Were you able to verify the maxIdleTime configuration?

RitikaDangal commented 3 years ago

@violetagg Will get back to you with an update by the end of this week.

supr015 commented 3 years ago

Hi @RitikaDangal , @violetagg ,

We were also facing a very similar issue with communication between springboot based microservices deployed in kubernetes. Reactor Netty version: 1.0.10 Spring boot version: 2.5.4

we were also using webClient = WebClient.builder().build(); , but we observed that once a request is complete, any subsequent request after about 20 mins was throwing the connection reset by peer issue with the same error as you have mentioned. However, the next request would go through as a new channel would get created then because of earlier disconnection. Most likely kubernetes was internally closing the connections on its end after 20 mins.

We tried setting the maxIdleTime with env varibales through reactor.netty.pool.maxIdleTime: 600000. It ddint seem to be updating the maxIdleTime though. we use spring-boot-starter-webflux.

So we added custom connector to the webclient like below.

var provider = ConnectionProvider.builder("custom-name") .maxConnections(500) .pendingAcquireTimeout(Duration.ofSeconds(45)) .maxIdleTime(Duration.ofSeconds(600)).build(); HttpClient client = HttpClient.create(provider).compress(true); WebClient.builder().clientConnector(new ReactorClientHttpConnector(client));

After this, the connection reset by peer exceptions were fixed . Any subsequent request after 10 mins of idle time would always cause the existing channel to disconnect and a new channel created.

RitikaDangal commented 3 years ago

@violetagg We used the following connection provider and the issue is now resolved.

    ConnectionProvider provider = ConnectionProvider.builder("fixed")
            .maxConnections(500)
            .maxIdleTime(Duration.ofSeconds(20))
            .maxLifeTime(Duration.ofSeconds(60))
            .pendingAcquireTimeout(Duration.ofSeconds(60))
            .evictInBackground(Duration.ofSeconds(120)).build();

    this.webClient = WebClient.builder()
            .clientConnector(new ReactorClientHttpConnector(HttpClient.create(provider)))
            .build();

Thanks

TDtianzhenjiu commented 2 years ago

Hello @violetagg after configuring the maxIdelTime it is work. But why?

Does it mean the connection in the connection pool has been closed by a remote peer? But it is still on the connection pool and still marked as available, once acquire that connection to read/write remote peer, will cause this exception?

violetagg commented 2 years ago

@TDtianzhenjiu Take a look here https://projectreactor.io/docs/netty/release/reference/index.html#faq.connection-closed it is similar to what you are asking.

TDtianzhenjiu commented 1 year ago

Thanks @violetagg 🙏 in this case, however, We can also retry on WebClientRequestException, it also can resolve this issue. am I right?

violetagg commented 1 year ago

Thanks @violetagg 🙏 in this case, however, We can also retry on WebClientRequestException, it also can resolve this issue. am I right?

@TDtianzhenjiu you have to be careful with requests retry (for example if they are not idempotent https://www.rfc-editor.org/rfc/rfc9110.html#section-9.2.2)

dalvan-bevilaqua commented 1 year ago

@violetagg We used the following connection provider and the issue is now resolved.

  ConnectionProvider provider = ConnectionProvider.builder("fixed")
          .maxConnections(500)
          .maxIdleTime(Duration.ofSeconds(20))
          .maxLifeTime(Duration.ofSeconds(60))
          .pendingAcquireTimeout(Duration.ofSeconds(60))
          .evictInBackground(Duration.ofSeconds(120)).build();

  this.webClient = WebClient.builder()
          .clientConnector(new ReactorClientHttpConnector(HttpClient.create(provider)))
          .build();

Thanks

solved to me

jzpeepz commented 8 months ago

@violetagg We used the following connection provider and the issue is now resolved.

  ConnectionProvider provider = ConnectionProvider.builder("fixed")
          .maxConnections(500)
          .maxIdleTime(Duration.ofSeconds(20))
          .maxLifeTime(Duration.ofSeconds(60))
          .pendingAcquireTimeout(Duration.ofSeconds(60))
          .evictInBackground(Duration.ofSeconds(120)).build();

  this.webClient = WebClient.builder()
          .clientConnector(new ReactorClientHttpConnector(HttpClient.create(provider)))
          .build();

Thanks

This worked for me, BUT my request now seem to take a LOT longer. up to 20+ seconds from <= 5 seconds before. Anyone else experience this?

YevhenPalamarchuk commented 4 months ago

Hi @RitikaDangal , @violetagg ,

We were also facing a very similar issue with communication between springboot based microservices deployed in kubernetes. Reactor Netty version: 1.0.10 Spring boot version: 2.5.4 .... We tried setting the maxIdleTime with env varibales through reactor.netty.pool.maxIdleTime: 600000. It ddint seem to be updating the maxIdleTime though. we use spring-boot-starter-webflux. ...

Hi, The issue with the non-functional environment parameter occurs because the System.getProperty() method is used in the following code: https://github.com/reactor/reactor-netty/blob/c33825e6d8cb9408642f089e1c61f4d6e086563a/reactor-netty-core/src/main/java/reactor/netty/resources/ConnectionProvider.java#L70-L72

This method reads system properties, not environment variables.

You can add the system property to the VM using the following command: java -Dreactor.netty.pool.maxIdleTime=30000 -jar /app/your-application.jar

mehedihasan03 commented 3 days ago

This is how I solved my problem.

@Bean
public HttpClient httpClientWithTimeout() {
    // Create the connection provider with desired settings
    ConnectionProvider provider = ConnectionProvider.builder("fixed")
            .maxConnections(500)                           // Max number of connections in the pool
            .maxIdleTime(Duration.ofSeconds(20))           // Time after which idle connections are closed
            .maxLifeTime(Duration.ofSeconds(60))           // Max lifetime for a connection in the pool
            .pendingAcquireTimeout(Duration.ofSeconds(60)) // Max wait time to acquire a connection from the pool
            .evictInBackground(Duration.ofSeconds(120))    // Frequency of eviction of idle connections
            .build();

    // Create HttpClient with connection provider and timeouts
    return HttpClient.create(provider)
            .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 60000)
            .responseTimeout(Duration.ofMillis(TIMEOUT))
            .doOnConnected(connection -> {
                connection.addHandlerLast(new ReadTimeoutHandler(60000, TimeUnit.MILLISECONDS));
                connection.addHandlerLast(new WriteTimeoutHandler(60000, TimeUnit.MILLISECONDS));
            })
            // Enable detailed logging
            .wiretap("reactor.netty.http.client.HttpClient", LogLevel.DEBUG, AdvancedByteBufFormat.TEXTUAL);
}