spring-projects / spring-framework

Spring Framework
https://spring.io/projects/spring-framework
Apache License 2.0
56.26k stars 37.98k forks source link

Reactive connection handling on Tomcat #32249

Closed mschweyen closed 6 months ago

mschweyen commented 7 months ago

We observed a problem with the connection handling when using Webflux on Tomcat. Connections remain open in CLOSE_WAIT status on the server site when the client closes the connection. This may lead to a situation where the server is not reachable if there are so many client connections that the server reaches its limit.

There is no problem when using Netty. Here is a small example (webflux-problem.zip) demonstrating that the same tests succeed on Netty and fails on Tomcat.

The problem may have a similar origin as Issue #30393.

rstoyanchev commented 7 months ago

@mschweyen you mention using WebFlux on Tomcat, but tomcat-demo in the example has spring-boot-starter-web, i.e. Spring MVC and Tomcat. Can you clarify on that please? Also, is the behavior related to a specific version or upgrade?

mschweyen commented 7 months ago

@rstoyanchev thank you for your feedback and sorry for my unclear description. You are right the problem I described was Reactive connection handling on Spring MVC and Tomcat. I have done further tests. It seems the same issue occurs when using Webflux on Tomcat. Here the updated demo project webflux-problem.zip.

We observed the problem in springboot version 3.2.2.

mschweyen commented 7 months ago

Is there anything I can do to support you with the analysis?

mschweyen commented 6 months ago

Could this message indicate that the connections are not cleaned up correctly: o.apache.tomcat.util.threads.LimitLatch : Counting up[http-nio-8080-Acceptor] latch=8192

The latch counter increments to 8192 and the the server stops to respond to new connections. Here a thread dump of this process: threads_report.txt

2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-7] o.s.w.r.r.m.a.ResponseBodyResultHandler  : [4627979] Using 'text/event-stream' given [text/event-stream] and supported [text/event-stream]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-7] o.s.w.r.r.m.a.ResponseBodyResultHandler  : [4627979] 0..N [org.springframework.http.codec.ServerSentEvent<java.lang.String>]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [o-8080-Acceptor] o.apache.tomcat.util.threads.LimitLatch  : Counting up[http-nio-8080-Acceptor] latch=8192
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.a.coyote.http11.Http11InputBuffer      : Before fill(): parsingHeader: [true], parsingRequestLine: [true], parsingRequestLinePhase: [0], parsingRequestLineStart: [0], byteBuffer.position(): [0], byteBuffer.limit(): [0], end: [0]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.a.tomcat.util.net.SocketWrapperBase    : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], Read from buffer: [0]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] org.apache.tomcat.util.net.NioEndpoint   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], Read direct from socket: [173]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.a.coyote.http11.Http11InputBuffer      : Received [POST /api/pdp/decide HTTP/1.1
user-agent: ReactorNetty/1.1.15
host: 192.168.178.25:8080
Accept: text/event-stream
Content-Type: application/json
content-length: 2

{}]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] org.apache.tomcat.util.http.Parameters   : Set query string encoding to UTF-8
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.a.c.authenticator.AuthenticatorBase    : Security checking request POST /api/pdp/decide
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] org.apache.catalina.realm.RealmBase      :   No applicable constraints defined
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.a.c.authenticator.AuthenticatorBase    : Not subject to any constraint
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.catalina.core.AsyncContextImpl  : Req:     null  CReq:     null  RP:     null  Stage: -  Thread: http-nio-8080-exec-2  State:                  N/A  Method: Constructor  URI: N/A
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] org.apache.coyote.AsyncStateMachine      : Changing async state from [DISPATCHED] to [STARTING]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.catalina.core.AsyncContextImpl  : Firing onStartAsync() event for any AsyncListeners
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.s.w.s.adapter.HttpWebHandlerAdapter    : [2d1c48ef] HTTP POST "/api/pdp/decide"
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.coyote.http11.Http11Processor   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], Status in: [OPEN_READ], State out: [LONG]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] org.apache.coyote.AsyncStateMachine      : Changing async state from [STARTING] to [STARTED]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.coyote.http11.Http11Processor   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], State after async post processing: [LONG]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.coyote.http11.Http11Processor   : Processing dispatch type: [NON_BLOCKING_READ]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.s.w.s.s.DefaultWebSessionManager       : Created new WebSession.
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] org.apache.coyote.AsyncStateMachine      : Changing async state from [STARTED] to [READ_WRITE_OP]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.coyote.http11.Http11Processor   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], Status in: [OPEN_READ], State out: [LONG]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.coyote.http11.Http11Processor   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], Status in: [OPEN_READ], State out: [LONG]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.s.s.w.s.u.m.OrServerWebExchangeMatcher : Trying to match using PathMatcherServerWebExchangeMatcher{pattern='/logout', method=POST}
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] org.apache.coyote.AsyncStateMachine      : Changing async state from [READ_WRITE_OP] to [STARTED]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] athPatternParserServerWebExchangeMatcher : Request 'POST /api/pdp/decide' doesn't match 'POST /logout'
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.coyote.http11.Http11Processor   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], State after async post processing: [LONG]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.s.s.w.s.u.m.OrServerWebExchangeMatcher : No matches found
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.coyote.http11.Http11Processor   : Processing dispatch type: [NON_BLOCKING_WRITE]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] org.apache.coyote.AsyncStateMachine      : Changing async state from [STARTED] to [READ_WRITE_OP]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.coyote.http11.Http11Processor   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], Status in: [OPEN_WRITE], State out: [LONG]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.coyote.http11.Http11Processor   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], Status in: [OPEN_READ], State out: [LONG]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.s.s.w.s.u.m.OrServerWebExchangeMatcher : Trying to match using PathMatcherServerWebExchangeMatcher{pattern='/**', method=null}
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] org.apache.coyote.AsyncStateMachine      : Changing async state from [READ_WRITE_OP] to [STARTED]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] athPatternParserServerWebExchangeMatcher : Checking match of request : '/api/pdp/decide'; against '/**'
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [nio-8080-exec-2] o.apache.coyote.http11.Http11Processor   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], State after async post processing: [LONG]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.s.s.w.s.u.m.OrServerWebExchangeMatcher : matched
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] a.DelegatingReactiveAuthorizationManager : Checking authorization on '/api/pdp/decide' using org.springframework.security.config.web.server.ServerHttpSecurity$AuthorizeExchangeSpec$Access$$Lambda$763/0x0000021b813c60b8@752004c4
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.s.s.w.s.a.AuthorizationWebFilter       : Authorization successful
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] s.w.r.r.m.a.RequestMappingHandlerMapping : [2d1c48ef] Mapped to io.sapl.webfluxtomcat.demo.DummyController#decide(Object)
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] .r.m.a.RequestBodyMethodArgumentResolver : [2d1c48ef] Content-Type:application/json
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] .r.m.a.RequestBodyMethodArgumentResolver : [2d1c48ef] 0..1 [java.lang.Object]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] org.apache.coyote.AsyncStateMachine      : Changing async state from [STARTED] to [READ_WRITE_OP]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.apache.coyote.http11.Http11Processor   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], Status in: [OPEN_READ], State out: [LONG]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.apache.coyote.http11.Http11Processor   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], Status in: [OPEN_READ], State out: [LONG]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] org.apache.coyote.AsyncStateMachine      : Changing async state from [READ_WRITE_OP] to [STARTED]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.apache.coyote.http11.Http11Processor   : Socket: [org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@3898940d:org.apache.tomcat.util.net.NioChannel@29629ce7:java.nio.channels.SocketChannel[connected local=/192.168.178.25:8080 remote=/192.168.178.25:30010]], State after async post processing: [LONG]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.s.http.codec.json.Jackson2JsonDecoder  : [2d1c48ef] Decoded [{}]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.s.w.r.r.m.a.ResponseBodyResultHandler  : [2d1c48ef] Using 'text/event-stream' given [text/event-stream] and supported [text/event-stream]
2024-02-29T23:20:29.322+01:00 DEBUG 18036 --- [     parallel-8] o.s.w.r.r.m.a.ResponseBodyResultHandler  : [2d1c48ef] 0..N [org.springframework.http.codec.ServerSentEvent<java.lang.String>]
2024-02-29T23:21:06.346+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : Start expire sessions StandardManager at 1709245266346 sessioncount 0
2024-02-29T23:21:06.346+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : End expire sessions StandardManager processingTime 0 expired sessions: 0
2024-02-29T23:22:06.417+01:00 DEBUG 18036 --- [alina-utility-1] o.apache.catalina.session.ManagerBase    : Start expire sessions StandardManager at 1709245326417 sessioncount 0
2024-02-29T23:22:06.417+01:00 DEBUG 18036 --- [alina-utility-1] o.apache.catalina.session.ManagerBase    : End expire sessions StandardManager processingTime 0 expired sessions: 0
2024-02-29T23:23:06.457+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : Start expire sessions StandardManager at 1709245386457 sessioncount 0
2024-02-29T23:23:06.457+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : End expire sessions StandardManager processingTime 0 expired sessions: 0
2024-02-29T23:24:06.503+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : Start expire sessions StandardManager at 1709245446503 sessioncount 0
2024-02-29T23:24:06.503+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : End expire sessions StandardManager processingTime 0 expired sessions: 0
2024-02-29T23:25:06.546+01:00 DEBUG 18036 --- [alina-utility-1] o.apache.catalina.session.ManagerBase    : Start expire sessions StandardManager at 1709245506546 sessioncount 0
2024-02-29T23:25:06.546+01:00 DEBUG 18036 --- [alina-utility-1] o.apache.catalina.session.ManagerBase    : End expire sessions StandardManager processingTime 0 expired sessions: 0
2024-02-29T23:26:06.594+01:00 DEBUG 18036 --- [alina-utility-1] o.apache.catalina.session.ManagerBase    : Start expire sessions StandardManager at 1709245566594 sessioncount 0
2024-02-29T23:26:06.594+01:00 DEBUG 18036 --- [alina-utility-1] o.apache.catalina.session.ManagerBase    : End expire sessions StandardManager processingTime 0 expired sessions: 0
2024-02-29T23:27:06.653+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : Start expire sessions StandardManager at 1709245626653 sessioncount 0
2024-02-29T23:27:06.653+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : End expire sessions StandardManager processingTime 0 expired sessions: 0
2024-02-29T23:28:06.715+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : Start expire sessions StandardManager at 1709245686715 sessioncount 0
2024-02-29T23:28:06.715+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : End expire sessions StandardManager processingTime 0 expired sessions: 0
2024-02-29T23:29:06.773+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : Start expire sessions StandardManager at 1709245746773 sessioncount 0
2024-02-29T23:29:06.773+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : End expire sessions StandardManager processingTime 0 expired sessions: 0
2024-02-29T23:30:06.816+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : Start expire sessions StandardManager at 1709245806816 sessioncount 0
2024-02-29T23:30:06.816+01:00 DEBUG 18036 --- [alina-utility-2] o.apache.catalina.session.ManagerBase    : End expire sessions StandardManager processingTime 0 expired sessions: 0
rstoyanchev commented 6 months ago

@mschweyen I've had a look at the sample.

I'm not sure what option(ChannelOption.SO_LINGER, 0) achieves. I see no difference with and without it. Basically, blockFirst() requests 1 and cancels the inbound stream, and Reactor Netty closes the connection. If I switch to blockLast(), it hangs for 10 seconds until the configured client response timeout, so again the option doesn't seem to achieve anything that I can see.

For a single run, with the spring-mvc-demo, I see one streamed event after which Tomcat triggers a timeout notification, which we handle with an ASYNC dispatch, and the request completes. I don't know why a timeout and not a closed connection error notification, but we handle both similarly. For the webflux-tomcat-demo, we use the non-blocking I/O API and the handling is a bit different (no ASYNC dispatch), but the sequence of events is the same.

For the long test with 30000 requests, each request streams its one event after which the client closes and performs the next request. On the server side, however, there are no timeout notifications, and as the controller stream is infinite, nothing further happens. This is both with Spring MVC and WebFlux. After the configured 10 second client response timeout, I see a flood of timeout notifications on the server side and those are handled as expected.

This is probably expected behavior due to Servlet containers not providing notifications for closed connections, see https://github.com/jakartaee/servlet/issues/44, so the only way to find out if there is a network issue is to keep writing (e.g. heartbeats) which eventually fails as the socket is not usable. For example, changing to:

@PostMapping(value = "/api/pdp/decide", produces = MediaType.TEXT_EVENT_STREAM_VALUE, consumes = MediaType.APPLICATION_JSON_VALUE)
public Flux<ServerSentEvent<String>> decide(
        @Valid @RequestBody Object input) {

    return Flux.merge(
            Flux.interval(Duration.ofSeconds(5))
                    .map(aLong -> ServerSentEvent.<String>builder().comment("heartbeat").build()),
            Flux.create((FluxSink<String> fluxSink) -> fluxSink.next(PERMIT))
                    .map(decision -> ServerSentEvent.<String>builder().data(decision).build()));
}

I start to see "Broken pipe" IOExceptions and the test runs to completion.

It is a little surprising that the Servlet non-blocking API doesn't do better, but that might be expected. It is perhaps worth finding out why Tomcat is able to notify us immediately for a single request, but not for many requests where it doesn't notice until the server is closed. It might be possible to improve something, /cc @markt-asf.

mschweyen commented 6 months ago

@rstoyanchev the option SO_LINGER=0 is set to avoid that the socket is held in TIME_WAIT state on the client site. Depending on the host config it can runs out of available sockets during load tests. The same can be reproduced with one server and multiple client hosts.


Here some additional information about our usecase: In our usecase the client is sending a request (subscription) to the server and the server answers with an indefinite flow of answers to the client. The Answers are sent over the Flux implemented as server-sent-events connection. Now it depends on the client how long the subsription persists. In this demo the client takes one single response (blockFirst).

The usecase is about a solution for stream based access control where a client sends an AuthroizationSubscription and the server is answering with a flow of AuthrizationDecisions. Now it's on the client to apply the decision. The Decision changes over time depending on the date, time, rule changes or any other exteral infromation. Actually we can not use netty because we are using Vaadin as Web UI.


Geat, thanks a lot for this solution. I understand that the heartbeat may be a way to ensure that tomcat detects a dead client. I will have a try now with our application and I'm confident that it solves the problem.

May be a good idea to find a gerneric solution for tomcat as this can easily lead to problems or even be used for DOS attacs.

rstoyanchev commented 6 months ago

Thanks for the extra context. I can't speak for Tomcat, but the above referenced Servlet issue is probably where something needs to change first, but I could be wrong.