WebSocket hangs in blockingWrite

dreis2211 commented 6 years ago

Hi,

it was suggested to me by @joakime to open a new issue for #272 as it still occurs on 9.4.8 - with a slight tendency to occur more often now (which might be just bad luck on our end).

Unfortunately, I can't say anything new about the issue. It still appears to be random (regardless of load for example) that threads end up in WAITING state and only a server restart helps to solve the issue.

"clientOutboundChannel-23" #144 prio=5 os_prio=0 tid=0x00007fceb0001000 nid=0x28f7 waiting on condition [0x00007fce59acc000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006ca773c88> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:210)
        at org.eclipse.jetty.websocket.common.BlockingWriteCallback$WriteBlocker.block(BlockingWriteCallback.java:90)
        at org.eclipse.jetty.websocket.common.WebSocketRemoteEndpoint.blockingWrite(WebSocketRemoteEndpoint.java:107)
        at org.eclipse.jetty.websocket.common.WebSocketRemoteEndpoint.sendString(WebSocketRemoteEndpoint.java:393)
        at org.springframework.web.socket.adapter.jetty.JettyWebSocketSession.sendTextMessage(JettyWebSocketSession.java:273)
        at org.springframework.web.socket.adapter.AbstractWebSocketSession.sendMessage(AbstractWebSocketSession.java:101)
        at org.springframework.web.socket.handler.ConcurrentWebSocketSessionDecorator.tryFlushMessageBuffer(ConcurrentWebSocketSessionDecorator.java:132)
        at org.springframework.web.socket.handler.ConcurrentWebSocketSessionDecorator.sendMessage(ConcurrentWebSocketSessionDecorator.java:104)
        at org.springframework.web.socket.messaging.StompSubProtocolHandler.sendToClient(StompSubProtocolHandler.java:470)
        at org.springframework.web.socket.messaging.StompSubProtocolHandler.handleMessageToClient(StompSubProtocolHandler.java:457)
        at org.springframework.web.socket.messaging.SubProtocolWebSocketHandler.handleMessage(SubProtocolWebSocketHandler.java:338)
        at org.springframework.messaging.support.ExecutorSubscribableChannel$SendTask.run(ExecutorSubscribableChannel.java:135)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

As this is affecting our production servers, I'd appreciate if this is investigated again. I'd also appreciate any workaround that doesn't suggest a server restart.

Cheers, Christoph

joakime commented 6 years ago

What are the various idle timeouts you have configured? Can you tell us what JVM version? and on what OS you are experiencing these problems?

joakime commented 6 years ago

OP is using stomp + spring. What version of stomp/spring are you using?

No reports of similar problems from cometd community, or google app engine community. (yet?)

dreis2211 commented 6 years ago

I'm running JDK 8u121 on Debian Jessie with Spring-Boot 1.5.9 (Spring Framework 4.3.13). As far as the idle-timeout concerns (and I understood SharedBlockingCallback correctly) it indefinetily blocks due to -1 as a timeout for WebSockets, hence the WAITING instead of TIMED_WAITING status of the threads. The idle timeout for the QueuedThreadPool is on 5000ms seconds, though.

joakime commented 6 years ago

The QueuedThreadPool idleTimeout will not take effect on the actively being used thread. The QueuedThreadPool idleTimeout is for idle timeout of threads returned to the pool and have not been allocated/requested for another process. These QTP threads are then removed from the pool entirely on their idle timeout.

The -1 idle-timeout on websocket is an indefinite idle timeout and is operating properly. Perhaps you really want an idle timeout on your websockets, if so, set the idle timeout on the websocket session (if you want it to apply to a specific connection), or the websocket container (if you want it to apply to all new connections).

dreis2211 commented 6 years ago

Hi @joakime. I was just answering your question for the various idle timeouts...

As the previous thread/issue #272 already mentions, many argue the -1 as an idle timeout in SharedBlockingCallback (or BlockingWriteCallback to be more specific) is somewhat dangerous as it leads to indefinitely blocking the threads which one cannot recover from. Hence the many workarounds to recompile Jetty with an overridden getIdleTimeout() on BlockingWriteCallback.

Correct me if I'm wrong, but your suggestion would not solve this problem. And even if it does, it would be just a different workaround instead of actually solving the root-cause (which I feel we're not getting closer to find unfortunately).

Speaking of the root-cause: Though we have no clear evidence, the behavior described in this issue and #272 seems to be caused by slow clients for us (we have many mobile users that might have a slow connection). Maybe this helps you to come up with a test scenario for the root-cause.

Cheers

joakime commented 6 years ago

I'm curious what the OS thinks of the connection. Is it still live? (tcpdump information should be tell us)

Having an idle timeout of -1 isn't normal. On the distribution its defaulted to 60,000 ms, and embedded usage seems to default to 30,000 ms. In fact, the only way I've been able to have a -1 idle timeout is to manually set it to -1.

The process to determine idle timeout seems to be ...

If the @WebSocket annotation has no idle timeout specified, then the WebSocket session's idle timeout is used.
If the WebSocket session has no idle timeout specified, then the WebSocket container's version is used.
If the WebSocket container has no idle timeout specified, then the Connector's idle timeout is used.
The standard Connector has an idle timeout of 30,000 ms

But even at -1 idle timeout, the active websocket connection with an active message write, sits in SharedBlockingCallback$Blocker.block() if attempting to write to a congested connection (one where the remote side isn't reading). It will sit like this indefinitely until the connection is closed or the congestion clears up. This is a sane setup for many websocket user. Especially connections used between servers in a data center.

If I set a websocket container max idle timeout then even that scenario idle times out.

I can sort-of reproduce the SharedBlockingCallback concern, but its only under a specific (working as intended) scenario ...

websocket idle timeout is -1 (unlimited)
the connection is active
the remote is not reading
the local is attempting write
the connection is congested

This is normal behavior for idle timeouts of -1.

elekktrisch commented 6 years ago

Here is a way to reproduce it consistently (even though I am not sure it actually is the exact same issue):

Register a few Transparent Proxy Beans in the Spring-Boot Configuration (the more of them, the more likely the issue occurs)

@Bean
public ServletRegistrationBean transparentProxyServlet() throws MalformedURLException {
ServletRegistrationBean servletRegistrationBean = new ServletRegistrationBean(new ProxyServlet.Transparent(), contextPath);
servletRegistrationBean.setName("testProxy");
servletRegistrationBean.addInitParameter("proxyTo", new URL("http", serverHost, serverPort).toString());

servletRegistrationBean.addInitParameter("prefix", "/foo");

return servletRegistrationBean;
}

Start on a Server with a lot of Hardware Resources (I encountered the issue after I added 5 transparent proxies. But only on the Server with 80 CPU Cores. It did not occur on the Server which only had 24 CPU Cores with the exact same Source Code)

My workaround for the time being was to allow for more Threads for Jetty:

 @Bean
  public JettyEmbeddedServletContainerFactory jettyEmbeddedServletContainerFactory(
      @Value("${server.port}") final String port,
      @Value("${jetty.threadPool.maxThreads:1000}") final String maxThreads,
      @Value("${jetty.threadPool.minThreads:8}") final String minThreads,
      @Value("${jetty.threadPool.idleTimeout:60000}") final String idleTimeout
  ) {
    final JettyEmbeddedServletContainerFactory factory = new JettyEmbeddedServletContainerFactory(Integer.valueOf(port));
    factory.addServerCustomizers((JettyServerCustomizer) server -> {
      final QueuedThreadPool threadPool = server.getBean(QueuedThreadPool.class);
      threadPool.setMaxThreads(Integer.valueOf(maxThreads));
      threadPool.setMinThreads(Integer.valueOf(minThreads));
      threadPool.setIdleTimeout(Integer.valueOf(idleTimeout));
    });

    return factory;
  }

At runtime, Jetty actually allocated over 300 Threads even before any traffic was on the server. Since 200 is the default max, Jetty probably tried to wait for more Threads available before serving any responses.

Might have to do with this: https://github.com/spring-projects/spring-boot/issues/8917

joakime commented 6 years ago

Ah! a proxy. How do we know that the problem lies in the websocket side? or the proxy side? A standard HTTP proxy is a common fault with websocket. Only WebSocket/Upgrade aware proxies are supported (they recognize the upgrade and no longer cache/buffer the resulting upgraded connection). Prior to WebSocket there were not many HTTP upgraded connection types to worry about. If you are using the Spring proxy, then that is a known to have issues with WebSocket (Netflix even wrote their own spring-proxy for project zuul to solve for this). spring-cloud and spring-cloud-gateway projects also report many issues with WebSocket using spring's proxy.

Regarding threading configuration. Keep in mind that allocated threads vs used threads is different. The threadpool will pre-allocate a number of threads for its pool, as the allocation step is costly one. But once allocated, the threads exist, are idle/waiting, and are being tracked by the pool.

joakime commented 6 years ago

@elekktrisch the spring-projects/spring-boot#8917 is about excessively low/insufficient thread pool configurations. The configurations you are talking about in your recent comment are healthy and not in the realm of that specific spring-boot issue.

If you are using Jetty 9.4.8.v20171121 then you will see warning/error logging events telling you of insufficient threads configurations (which was recently added as a response to spring-projects/spring-boot#8917

See: Issue #1851 and https://github.com/eclipse/jetty.project/blob/jetty-9.4.8.v20171121/jetty-util/src/main/java/org/eclipse/jetty/util/thread/ThreadPoolBudget.java#L137-L157

dreis2211 commented 6 years ago

We don't use Spring proxy facilities and don't have an invalid thread-configuration (at least we're not affected by the mentioned Spring-Boot issue).

What we do use though is nginx as a reverse proxy. Which to my knowledge is WebSocket/Upgrade aware.

joakime commented 6 years ago

nginx added websocket support mid 2014 with I believe nginx version 1.3.

Just make sure you turn off proxy buffering (a known problem with nginx's implementation)

proxy_buffering off;

Also make sure you set a valid proxy_read_timeout for your nginx configuration (do not use infinite values here).

joakime commented 6 years ago

Looking at the various reports of websocket support in nginx it seems that version 1.3.13 is the first version that the userbase reports as stable for websocket.

dreis2211 commented 6 years ago

Thanks. We already use 1.12.2 with proxy_buffering off;.

joakime commented 6 years ago

@dreis2211 what's your nginx proxy settings for ...

proxy_read_timeout 120s;
proxy_write_timeout 120s;

joakime commented 6 years ago

Also, depending on your version of nginx it might be proxy_send_timeout instead of proxy_write_timeout

dreis2211 commented 6 years ago

@joakime Both are not specified in our config, so they should use the default value of 60 seconds

TheWizz commented 6 years ago

We've had this problem for some time. I've found a 100% sure way to reproduce it which may be of interest. First some background. We're currently running 3.9.19 under Spring Boot 1.4.7. Our application often has hundreds of mobile clients over websockets, based on Stomp and the simple broker built into spring. We had occasional problems under heavy load on our servers where the websocket traffic just stopped (while other "regular" requests continue to work, and the server stayed up). When this "zombie mode" occurred, the only remedy was to restart the server. We've scrutinized our code, and found nothing here causing it, so I began suspecting something else in the stack. Therefore I was very happy when I found this thread and the previous one #272, which seem to describe exactly the problem we're seeing. We had an older version of jetty before (9.2.16), and there were some hints along the way that this may have been "fixed" in 9.2.x, so we started by upgrading Spring Boot along with jetty to the versions mentioned above. That did NOT fix the problem. I then proceeded applying the work-around described in #272 ("overriding" the socket write timeout), and that seem to have fixed it for us. The sockets still hang on write, and if a lot of websockets "die" on us at the same time, this may still stave the thread pool for some time. But once those writes time out, the server comes back to normal again, instead of requiring a server restart. This was a huge improvement for us.

Now to the way to reproduce it. Unfortunately, all of this is in company code, so I can't share the code. But I can share the general idea.

We have tests written to simulate large numbers of clients, doing both regular requests as well as the websocket communication. A single nodejs instance can easily simulate several hundred users (with the exception of cellular network peculiarities). Since this is nodejs based, it's also easy to fire up a couple of Amazon EC2 instances to run really massive tests for an hour or so, and then shut them down.

Now, as long as everyone plays nice, and close their websocket connections in an orderly manner, everything works and is happy. It appears that the lock-up happens ONLY when sockets aren't closed but merely cease to communicate, as would happen if the phone in toe other end suddenly goes off grid, is turned off, battery dies, or similar.

The way I found I could simulate this situation is to run a test with, say, 100 simulated users on a separate machine on our network. I have the system in a state where it keeps sending data over the websocket to those 100 clients. I then just yank the Ethernet wire out of the machine running the test clients. This quickly causes any send buffers to fill up, and the sending threads to (presumably) block on some low level write call, very soon causing all threads in the pool to block. Our app uses the standard thread pool provided by Spring/Jetty here, which has a pool with twice the number of threads as the processor has cores, so we typically see 8 or 16 threads in the pool. That means that as long as we have more than this number of clients all "die" at about the same time, while data is being sent to those websockets, the server will almost instantly go into "zombie mode". Having a reproducible test like this quickly allowed us to try out various scenarios, and let us come to the conclusion that the timeout is a viable work-around for now.

Hopefully this will get fixed in jetty in some more permanent and elegant way, and perhapsmy ramblings here can allow you to at least get a reproducible test case to track it down. I don't feel sufficiently familiar with Jetty's innards to venture a fix myself.

-JM

joakime commented 6 years ago

We've tested Jetty 9.3.x and 9.4.x at 10k+ active websocket connections, with some connections going away (no traffic, just dead). If we have a reasonable Idle Timeout things work as expected.

Even in the infinite timeout scenario (Idle Timeout == -1) the dead network connections will eventually be cleaned up by the OS, dependent on network configuration, and the connection to Jetty will die and be cleaned up as well.

The only scenario we've been able to use to reproduce this reported issue is having a simple proxy with its own infinite timeout, and jetty configured for infinite timeout. Then the connection remains "live" between Jetty and the proxy, with the proxy itself not cleaning up the connection when it goes away. This behavior seems to be exacerbated with various proxy connection buffering configurations as well.

In summary so far (Jan 2018):

So far, only some stomp/spring users are the reporting this behavior with mobile clients. (many other heavy stomp + mobile users have no such issue, and other libraries that use Jetty WebSockets have not reported these behaviors on mobile clients as well)
The reported reproduction cases have not been provided to the Jetty developers yet.
- @elekktrisch provided example is incomplete and cannot be run as is.
- Running the https://github.com/spring-guides/gs-messaging-stomp-websocket example produces no issues (tested as server on Ubuntu 16.04.3, Fedora Core 25, Windows 10, OSX 10.12.6. Using Java 8u144 on each machine)
- Modifying the https://github.com/spring-guides/gs-messaging-stomp-websocket in the way @elekktrisch stated makes no difference.
- We have obtained a 32 core machine recently ourselves to attempt to reproduce this locally. (we have been in contact with a known user with a high load, single jetty instance, on a 48+ core machine with Jetty WebSockets and hundreds of thousands of live mobile clients, they have been monitoring for over a year now, and have reported no issues similar to #272 and #2061 yet. They are not using spring/stomp but cometd)
If you have unreliable connections, use a sane idle timeout, everywhere. (your websockets, your connections, your OS networking configuration, your proxies, etc..)
If you have a high load scenario, the typical general advice is:
- don't use a executor (threadpool) implementation/configuration with a fixed upper limit (use one that dynamically grows, applies low resource backpressure, and falls back to a more normal steady state when load subsides. Such as the QueuedThreadPool from Jetty)
- don't use blocking send/writes. (this holds a thread)
- use async send/writes. (only uses thread when read/write is possible)
- don't use java.io streaming reads in websocket (each message causes a new thread)
- don't use java.io streaming writes in websocket (each message causes a new thread)

If you want to help, fork the example project ... https://github.com/spring-guides/gs-messaging-stomp-websocket Modify it to cause the issue being reported, and report back here with your forked version for us to test against.

TheWizz commented 6 years ago

I can try putting together a reproducible case based on the "gs-messaging-stomp-websocket" example you suggest, combined with a nodejs-based multi-client simulator. That's what we use to run our own load testing.

However, before pursuing this, I'd like to make sure we indeed "use a sane idle timeout everywhere", as you suggest. Perhaps you could provide some pointers as to how/where timeouts are specified in Jetty and/or Spring for "your websockets" and "your connections". Any pointers related to timeouts in the "OS networking configuration" (assuming Linux OS) would be most appreciated too. After making sure we have those set to cope with unreliable clients (i.e., lots of mobile phones on cellular networks), I'll re-run our tests to see if I can still repro the lockup. If so, I'll do my best to put together a reproducible test case I can contribute.

-JM

dreis2211 commented 6 years ago

@TheWizz any news from your end about this ticket?

TheWizz commented 6 years ago

No. We're using the work-around suggested by jonhill1977 in https://github.com/eclipse/jetty.project/issues/272. This is really a "hack", since it involves replacing the org.eclipse.jetty.websocket.common.BlockingWriteCallback class with our own (as there seem to be no way to override this behavior from the "outside"). Doing so solved the problem for us, and we just moved on. A cleaner solution would of course be preferred.

dreis2211 commented 6 years ago

So no other "solution" than ours. But thanks for getting back.

dreis2211 commented 6 years ago

@joakime Triggered by #2491 I was thinking if the clients connecting to our servers use the fragments extension and if the two bugs might be connected. While they (unfortunately) don't use it they make use of the per-message-deflate extension. I don't know if that helps for this bug or if it's the same for the other reporters, but I thought this is some new information that I need to share for completeness reasons.

joakime commented 6 years ago

The Fragment Extension is very rarely used.
It's not even a formal spec extension. Only the Autobahn lib and Jetty support it. So you'll never see it from any other client implementation (such as a web browser).

If you suspect the permessage-deflate, disable it on the server side. (either by removing it form the websocket extension registry, or by not negotiating it during the handshake on the server side).

Option 1: If using WebSocketservlet (to unregister the permessage-deflate extension)

public static class MyWebSocketServlet extends WebSocketServlet
{
    @Override
    public void configure(WebSocketServletFactory factory)
    {
        factory.getExtensionFactory().unregister("permessage-deflate");

        // The rest of your processing here ...
    }
}

Option 2: If using WebSocketCreator (to remove just permessage-deflate)

public static class MyWebSocketCreator implements WebSocketCreator
{
    @Override
    public Object createWebSocket(ServletUpgradeRequest servletUpgradeRequest, ServletUpgradeResponse servletUpgradeResponse)
    {
        // Strip permessage-deflate negotiation
        servletUpgradeRequest.setExtensions(
                servletUpgradeRequest.getExtensions()
                        .stream()
                        .filter((extConfig) -> !extConfig.getName().equals("permessage-deflate"))
                        .collect(Collectors.toList())
        );

        // The rest of your processing here ...
    }
}

Option 3: If using JSR's ServerEndpointConfig.Configurator (to remove all extensions offered)

public static class MyServer_NoExtensionsConfigurator extends ServerEndpointConfig.Configurator
{
    @Override
    public List<Extension> getNegotiatedExtensions(List<Extension> installed, List<Extension> requested)
    {
        // Strip all offered extensions
        return Collections.emptyList();
    }
}

Option 4: If using JSR's ServerEndpointConfig.Configurator (to to filter permessage-deflate offered extension)

public static class MyServer_NoPerMessageConfigurator extends ServerEndpointConfig.Configurator
{
    @Override
    public List<Extension> getNegotiatedExtensions(List<Extension> installed, List<Extension> requested)
    {
        List<Extension> negotiated = new ArrayList<>();

        for (Extension offered : requested)
        {
            if (offered.getName().equalsIgnoreCase("permessage-deflate"))
            {
                // skip
                continue;
            }

            if (installed.stream().anyMatch((available) -> available.getName().equals(offered.getName())))
            {
                negotiated.add(offered);
            }
        }

        return negotiated;
    }
}

TheWizz commented 6 years ago

We're definitely not using any uncommon options for our websockets. A way I found to repro it is to have a significant number (50 or so) of websockets running, and then "unplug" those clients rather than closing them gracefully. I do this by having a computer (runnig some NodeJS-based test code to excercise the websockets), and then merely pull the Ethernet cable from the test machine. I then terminate the Node test code (with Ethernet still unplugged). I then plug it back in, and restart the test code. This leaves numerous websocket connections in a "zombie" state. Any attempts at writing to those sockets will eventully block somewhere. It is possible that things will eventually time out. But from the clients point of view, the server is "hung" as it stops responding to calls.

Note that id I instead of unplugging the Ethernet cable close down the test clients orderly, no hang occurs. So it seems related to this "unexpected" termination of the socket. Surely, this is "wrong" on the websocket clients part. But I suspect that's essentially what happens with real websocket clients, which are typically phones. For instance, the phone leaves cellular coverage while being connected, it runs out of battery. Or perhaps it's just put to sleep and back in the pocket. If a few connections "misbehave" in this way, it seems to cause no harm. But if enough of them do so quickly enough, all websocket threads likely end up blocking somewhere.

Increasing the pool size seems to push this problem further a bit. But given enough misbehaving clients, it will still happen at some point.

It seems that applying some reasonable write timeout, causing the write to be aborted unless completed within a (relatively short - say seconds) timeframe, fixes the problem, as it indicates the websocket has gone bad, and will therefore be disscarded.

Some of the above is empirical tests (with out nodejs based test code, and pulling the Ethernet jack), while the last paragraph is speculation on my part. But in either case, hacking the jetty code, as suggested by jonhill1977 in #272, makes the "hang" go away, and clients happy.

-JM

joakime commented 6 years ago

@TheWizz doing various "unplug" tests show that the idle timeouts do kick in and terminate the connection and free up the resources associated with it.

Using this test project - https://github.com/joakime/jetty-websocket-browser-tool

It will compile down to a uber-jar if you want to move it around to other systems to try with.

The test setup used ...

Laptop A (Wired Networking only. Wifi turned off) - this is the client machine. Laptop A is wired connection to switch X Switch X is wired to Switch Y Server Foo is wired to Switch Y Server Foo is a Fedora 25 machine with Java 8u171 and Jetty 9.4.9. Server Bar is wired to Switch Y Server Bar is a Windows 10 machine with Java 8u171 and Jetty 9.4.9.

Server is started.
WebSocket client is connected (to one of 4 endpoints: JSR Basic Remote, JSR Async Remote, Jetty Blocking Remote, Jetty Async Remote)
WebSocket client requests a flood of messages (text and/or binary, different sizes, etc)
While WebSocket client is receiving messages, the network cable on Laptop A is pulled.
Configured Idle Timeout later, the server shows the following behavior ...

java -jar .\target\jetty-websocket-browser-tool-1.0-SNAPSHOT.jar --websocket.timeout.idle=10000
2018-05-02 14:37:00.009:INFO::main: Logging initialized @197ms to org.eclipse.jetty.util.log.StdErrLog
2018-05-02 14:37:00.056:INFO:oejd.WSServer:main: Connector Idle Timeout (ms): 300000
2018-05-02 14:37:00.081:INFO:oejd.WSServer:main: WebSocket Idle Timeout (ms): 10000
2018-05-02 14:37:00.203:INFO:oejs.Server:main: jetty-9.4.z-SNAPSHOT; built: 2018-03-20T07:21:10-05:00; git: 1f8159b1e4a42d3f79997021ea1609f2fbac6de5; jvm 10+46
2018-05-02 14:37:00.264:INFO:oejsh.ContextHandler:main: Started o.e.j.s.ServletContextHandler@ed7f8b4{/,jar:file:/C:/code/jetty/github/jetty-websocket-browser-tool/target/jetty-websocket-browser-tool-1.0-SNAPSHOT.jar!/debug-base/,AVAILABLE}
2018-05-02 14:37:00.376:INFO:oejs.AbstractConnector:main: Started ServerConnector@7c06669{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
2018-05-02 14:37:00.377:INFO:oejs.Server:main: Started @572ms
2018-05-02 14:37:00.382:INFO:oejd.WSServer:main: Server started on http://192.168.0.119:8080/
2018-05-02 14:37:13.938:INFO:oejdj.JsrBasicWebSocket:qtp1401132667-14: onTextMessage(floodtext:20,10000)
2018-05-02 14:37:13.940:INFO:oejd.FloodTextTask:qtp1401132667-14: Creating min:20 max10000
2018-05-02 14:37:13.943:INFO:oejd.FloodTextTask:qtp1401132667-18: active:true isConnected:true
2018-05-02 14:37:36.493:WARN:oejdj.JsrBasicWebSocket:Scheduler-1897115967: onBaseError(java.util.concurrent.TimeoutException)
java.util.concurrent.TimeoutException: Idle timeout expired: 10001/10000 ms
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
        at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2018-05-02 14:37:36.497:WARN:oejwcec.CompressExtension:Scheduler-1897115967:
java.util.concurrent.TimeoutException: Idle timeout expired: 10001/10000 ms
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
        at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2018-05-02 14:37:36.498:WARN:oejut.QueuedThreadPool:qtp1401132667-18:
org.eclipse.jetty.websocket.api.WebSocketException: Unable to send basic text message: java.util.concurrent.TimeoutException: Idle timeout expired: 10001/10000 ms
        at org.eclipse.jetty.demo.jsr.JsrBasicWebSocket.sendTextMessage(JsrBasicWebSocket.java:46)
        at org.eclipse.jetty.demo.FloodTextTask.run(FloodTextTask.java:61)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:754)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:672)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by:
java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 10001/10000 ms
        at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:226)
        at org.eclipse.jetty.websocket.common.BlockingWriteCallback$WriteBlocker.block(BlockingWriteCallback.java:90)
        at org.eclipse.jetty.websocket.common.WebSocketRemoteEndpoint.blockingWrite(WebSocketRemoteEndpoint.java:107)
        at org.eclipse.jetty.websocket.common.WebSocketRemoteEndpoint.sendString(WebSocketRemoteEndpoint.java:394)
        at org.eclipse.jetty.websocket.jsr356.JsrBasicRemote.sendText(JsrBasicRemote.java:108)
        at org.eclipse.jetty.demo.jsr.JsrBasicWebSocket.sendTextMessage(JsrBasicWebSocket.java:42)
        at org.eclipse.jetty.demo.FloodTextTask.run(FloodTextTask.java:61)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:754)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:672)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by:
java.util.concurrent.TimeoutException: Idle timeout expired: 10001/10000 ms
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
        at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2018-05-02 14:37:36.499:INFO:oejdj.JsrBasicWebSocket:Scheduler-1897115967: onBaseClose(1001, Idle Timeout)
2018-05-02 14:37:36.501:WARN:oejut.QueuedThreadPool:qtp1401132667-18: Unexpected thread death: org.eclipse.jetty.util.thread.QueuedThreadPool$2@2ab4ca4b in QueuedThreadPool[qtp1401132667]@5383967b{STARTED,8<=12<=200,i=4,q=0}
2018-05-02 14:37:36.502:WARN:oejdj.JsrBasicWebSocket:Scheduler-1897115967: onBaseError(java.util.concurrent.TimeoutException)
java.util.concurrent.TimeoutException: Idle timeout expired: 10001/10000 ms
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166)
        at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)

The behavior and results are the same for Blocking writes and Async writes ...

Blocking Writes

javax.websocket.RemoteEndpoint.Basic .sendText(String)
javax.websocket.RemoteEndpoint.Basic .sendBinary(ByteBuffer)
org.eclipse.jetty.websocket.api.RemoteEndpoint .sendString(String)
org.eclipse.jetty.websocket.api.RemoteEndpoint .sendBytes(String)

Async Writes

javax.websocket.RemoteEndpoint.Async .sendText(String, SendHandler)
javax.websocket.RemoteEndpoint.Async .sendBinary(ByteBuffer, SendHandler)
org.eclipse.jetty.websocket.api.RemoteEndpoint .sendStringByFuture(String)
org.eclipse.jetty.websocket.api.RemoteEndpoint .sendBytesByFuture(String)

A variety of clients have been used (so far) to test this behavior with no change in results (the blocked write is exited when the idle timeout fires)

Google Chrome 66 (on desktop)
Firefox 59 (on desktop)
Jetty WebSocketClient
Jetty's JSR356 ClientContainer
Tyrus JSR356 Client
Autobahn WebSocket Client
Google pyWebSocket
wscat (found on npm)

I will continue to improve this simple tool to attempt to get to the bottom of this reported issue.

dreis2211 commented 6 years ago

@joakime Your efforts are highly appreciated!

May I ask what happens when you remove the timeout configuration at https://github.com/joakime/jetty-websocket-browser-tool/blob/master/src/main/java/org/eclipse/jetty/demo/WSServer.java#L125

Is the behaviour reproducible in that case?

Cheers, Christoph

joakime commented 6 years ago

@dreis2211 there are 3 timeout types in that codebase.

Connector based as

// Jetty HTTP or HTTPS or UnixSocket or HTTP/2 Connector
ServerConnector.setIdleTimeout(30000); // in milliseconds

Seen at WSServer.java Line 101

JSR356 (javax.websocket) based as

// Server Container
javax.websocket.ServerContainer.setDefaultMaxSessionIdleTimeout(10000); // in milliseconds
// or Client Container
javax.websocket.WebSocketContainer.setDefautlMaxSessionIdleTimeout(10000); // in milliseconds
// or Individual Session
javax.websocket.Session.setMaxIdleTimeout(10000); // in milliseconds

Seen at WSServer.java Line 119

Jetty WebSockets (org.eclipse.jetty.websocket) based as

// On the Policy
org.eclipse.jetty.websocket.api.WebSocketPolicy.setIdleTimeout(10000); // in milliseconds
// Policy on the Server Container
org.eclipse.jetty.websocket.servlet.WebSocketServletFactory.getPolicy().setIdleTimeout(10000); // in milliseconds
// Policy on the Client Container
org.eclipse.jetty.websocket.client.WebSocketClient.getPolicy().setIdleTimeout(10000); // in milliseconds
// Policy on an individual Session
org.eclipse.jetty.websocket.api.Session.getPolicy().setIdleTimeout(10000); // in milliseconds

Seen at WSServer.java Line 125

Test with no idle timeouts

If you remove all 3 timeouts, by setting them to -1 (this is done to avoid default values), then yes. you can replicate the SharedBlockingCallback$Blocker.block behavior reported. Which is the expected behavior in this configuration. (can't emphasize this enough)

On java, with this blocked, and no idle timeout, there are no events that clean up that connection. Either an idle timeout kicks in, or something has to read or write on that connection for it to detect the stale connection. Even if the OS recognizes the connection as stale and cleans it up at the OS level, that event is never communicated to the Java JVM (this is normal and expected java JVM behavior). The Java JVM has to attempt to read or write (and likely get an exception) for the Java based resources to be cleaned up.

Note: theses are just 3 of the typical idle timeouts you would have available to configure. There are many more, but these 3 (well, 2 if you just use one of the two websocket apis and a single connector) are typical and usually the easiest to configure. The websocket ones, for example, can be set from within your own websocket implementations, either at the websocket container at startup, or within your specific websocket session.

There are also other timeouts scattered around the codebase for other purposes that don't apply to this specific issue:

HttpClient's Request Idle Timeout
HttpClient's Response Idle Timeout
Endpoint Idle Timeout (this one is related to this issue, but is a pain in the butt to set from within websocket code)
HTTP/2 Per Connection Idle Timeout
HTTP/2 Per Session Idle Timeout
HTTP/2 Per Stream Idle Timeout
HttpClient Connect Timeout
HttpClient Proxy Timeout
HttpClient DNS Timeout
Servlet HttpSession Idle Timeout
Servlet AsyncContext Timeout
etc ...

The techniques that @jonhill1977 uses in issue #272 is actually invalid, as it cannot clean up the resources held by websocket and perform a proper close handshake on the websocket connection. This is because that technique has no knowledge of the state of the websocket stream, or the state of the endpoint read / write, etc. The APIs available to "close" the connection perform a websocket close handshake, they don't terminate/disconnect/abort the connection. So the minute you "timeout" of the SharedBlockingCallback with the @jonhill1977 hack you are right back into the SharedBlockingCallback of the close handshake again. The correct way to handle this is to set any of the idle timeouts to a sane value.

In conclusion (for server behavior)

For a server, set a sane Idle Timeout, don't run with infinite (or disabled) idle timeout.

Set the Idle Timeout at :

The websocket session level
or The websocket container level
or The jetty server connector level

More to come

Since this issue and #272 cover both behavior on server side and client side, the tests now need to move to cover the Jetty WebSocket client behavior in this regard as well (I've been using the 2 websocket clients in Jetty 9.4.9 already for the server tests, but they need more scrutiny on the "unplug" tests that @TheWizz suggested).

joakime commented 6 years ago

jetty-websocket-browser-tool update

project: https://github.com/joakime/jetty-websocket-browser-tool commit: joakime/jetty-websocket-browser-tool@39878c78e2efeae9fe1625eaf746036c3c4329d0

The test tooling now has --jmx command line option to make the server information available on JMX (makes inspection of the server dump, thread dump, heap dump, etc easier). New information from JMX still does not reveal any ill behavior with server + "unplug" + with idle timeout.

dreis2211 commented 6 years ago

If you remove all 3 timeouts, by setting them to -1 (this is done to avoid default values), then yes. you can replicate the SharedBlockingCallback$Blocker.block behavior reported. Which is the expected behavior in this configuration. (can't emphasize this enough)

I asked because I was searching on the Spring-Framework side for this sort of "misconfiguration". While doing so I found out that the timeouts on the websocket session level are not set specifically in the Spring-Framework like in your code. At least I can't find code that sets it. I could only find the use of WebSocketPolicy.newServerPolicy() in JettyRequestUpgradeStrategy, which would indicate a idle timeout of 5 minutes. See https://github.com/spring-projects/spring-framework/blob/d5b0df8c3f229f8725e5959e8182fec6e37726c5/spring-websocket/src/main/java/org/springframework/web/socket/server/jetty/JettyRequestUpgradeStrategy.java#L83. Yet, I have no clue if that is actually sufficient enough or if the use of WebSocketServerFactory inside that upgrade strategy actually propagates that to the appropriate place.

On connector level I'd argue that also the default timeout is taken for us: e.g. 30 seconds. So I'm wondering why we still run into that issue.

Cheers and many thanks for the detailed work so far.

joakime commented 6 years ago

@dreis2211 can you verify the idle timeouts on a running server via debug?

Just to make sure that no other component is resetting those values?

dreis2211 commented 6 years ago

Debug being a specific log or via JMX?

joakime commented 6 years ago

As in an IDE, with a debugger, navigating to the components and verifying those settings in a live environment/configuration that show this behavior for you.

dreis2211 commented 6 years ago

Any specific line of code coming to your mind where I should set the breakpoints? Checking on AbstractConnector.getIdleTimeout() and AbstractConnector.setIdleTimeout() and WebSocketPolicy.getIdleTimeout() the WebSocketPolicy.setIdleTimeout() seems sane so far.

dreis2211 commented 6 years ago

Dumping some info via JMX shows the following information.

 +- WebSocketServerFactory@6ce24203[defaultPolicy=WebSocketPolicy@4c4e5fd8[behavior=SERVER,maxTextMessageSize=65536,maxTextMessageBufferSize=32768,maxBinaryMessageSize=65536,maxBinaryMessageBufferSize=32768,asyncWriteTimeout=60000,idleTimeout=300000,inputBufferSize=4096],creator=org.eclipse.jetty.websocket.server.WebSocketServerFactory]=org.eclipse.jetty.websocket.server:context=ROOT,type=websocketserverfactory,id=0

Unfortunately, I can't find the ServerConnector properties.

joakime commented 6 years ago

If you are setting breakpoints then ...

The ServerConnector / AbstractNetworkConnector / AbstractConnector idleTimeout is good. yes. The WebSocketPolicy is useful if you are using the Jetty Native WebSocket behaviors.

Add also ...

AbstractWebSocketConnection.setMaxIdleTimeout()

If you are using JSR356 (javax.websocket) look at ...

WebSocketSession.setIdleTimeout()

dreis2211 commented 6 years ago

AbstractWebSocketConnection.setMaxIdleTimeout() passes 300.000ms

dreis2211 commented 6 years ago

Is WebSocketSession.getIdleTimeout() called anywhere? I can't find any usages of it. Also, it doesn't seem to hit my breakpoint yet it hits one in the constructor in my tests. Just for completeness: WebSocketSession.setIdleTimeout() seems to be not called either.

kylejmcintyre commented 6 years ago

My organization is also impacted by this issue in various environments including production. We're running Jetty 9.4.8v20171121 as part of DropWizard - no sprint/boot. We are behind an nginx proxy. It seems like it became more prevalent when we upgraded from 9.4.6 to 9.4.8. We have a live production test harness that is remarkably good at reproducing it. The harness is using the python websocket client and regularly creates a condition where the websocket is being torn down at the same time the server is attempting to send a message across it. Sometimes when this occurs, we hit this bug and the thread that was attempting to send the message out is gone forever.

EDIT - I have now managed to reproduce it without the proxy

kylejmcintyre commented 6 years ago

At this point I'm going to switch our app to use sendStringByFuture instead of sendString since I don't really need the synchronous results or care much about the deliverability. In fact, I'm hoping it will squash a bunch of noisy logs that occur from exceptions coming out of the synchronous version of the API relating to closed/disconnected connections. Given that the sync methods generally use the same code as the async methods with the extra blocking callback stuff, it seems likely that a leak may still occur but I'd much rather leak a reference to a future than an entire thread.

gregw commented 6 years ago

@kylejmcintyre we may have a reproduction of this issue. Are you still suffering from it and do you have any more information?

dreis2211 commented 6 years ago

Awesome news @gregw . Could you elaborate?

gregw commented 6 years ago

@dreis2211 Not really. We are seeing something similar in some tests some times, but can't get a hard reproduction, so have not been able to analyse. We are continuing to try to capture an instance of it, so any additional info you can give might help us get a reliable reproduction.

bennAH commented 6 years ago

Seeing this issue with a vanilla jetty set up. No spring or other libraries, just the embedded jetty.

Using version 9.4.6 running on Red Hat 7.5

This is what is set in the policy:

factory.getPolicy().setIdleTimeout(600_000);

This is the thread dump:

java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Native Method)

parking to wait for <6c1f7450> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:210) at org.eclipse.jetty.websocket.common.BlockingWriteCallback$WriteBlocker.block(BlockingWriteCallback.java:90) at org.eclipse.jetty.websocket.common.message.MessageWriter.flush(MessageWriter.java:155) at org.eclipse.jetty.websocket.common.message.MessageWriter.close(MessageWriter.java:122)

joakime commented 6 years ago

WebSocketServletFactory is the container (which impacts all new websocket sessions) Then you have the WebSocket Session (which can be configured with different timeout values for a specific active session)

Unrelated note: Jetty 9.4.6 is vulnerable - see announcement https://dev.eclipse.org/mhonarc/lists/jetty-announce/msg00123.html (consider upgrading)

bennAH commented 6 years ago

Thanks. I will look to upgrade. However reading through this thread and others, it does not seem this issue has been fixed yet? (please correct me if I am wrong).

It is a shame it is hard to reproduce. I've not found there to be any correlation with load or number of active connections, however I've only recently encountered this.

bennAH commented 6 years ago

Hey,

The application is running on Linux where I have seen this happening. Not seen this happen when running locally on windows yet.

On 14 August 2018 at 00:57, Joakim Erdfelt notifications@github.com wrote:

@bennAH https://github.com/bennAH out of curiosity, are you using MS Windows?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eclipse/jetty.project/issues/2061#issuecomment-412705604, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7mSUzQSsCEto-L0VhN-5pvJYwrtRh-ks5uQhJfgaJpZM4RCYSo .

joakime commented 6 years ago

Heh, sorry @bennAH I deleted that question after I wrote it, being that you already said what OS you hit the issue on.

bennAH commented 6 years ago

hey @joakime any updates on this?

joakime commented 6 years ago

@bennAH (to borrow @gregw response)

Not really. We are seeing something similar in some tests some times, but can't get a hard reproduction, so have not been able to analyse. We are continuing to try to capture an instance of it, so any additional info you can give might help us get a reliable reproduction.

We have constant load testing of websocket on various platforms. We also have many many users of it, with only the folks on this issue reporting this particular issue with it.

So far, the only things we know ...

Using blocking writes (those using async writes have no issue)
Using blocking writes with infinite (or huge) idle timeouts - this is easy to replicate. the solution is to set the endpoint idle timeout to something low as a test (like 2 minutes) and try again, if you idle timeout, great! if you don't then your configuration isn't set in the correct place for that specific endpoint instance.
Possibly occurs from non-core websocket threads. - various testing scenarios of this has not replicated.
Possibly occurs when mulitiple threads are attempting to use blocking writes. - various testing scenarios of this has not replicated this issue (it causes a different issue)
Possibly occurs if a write is attempted after the WebSocket is closed (this means: write is enqueued, close handshake starts, write to network is attempted). - various testing scenarios of this has not replicated.
Possibly occurs on JVM's with NIO bug (easy to trigger on Windows, but not unique to Windows) - most common forms of this has been fixed in 9.4.9+ (but still occurs, see Issue #2205)
Possibly occurs on environments with custom (non-QueuedThreadPool) executors or ThreadPools with too restricted thread pool limits - some testing of this has been attempted, and only insufficient thread scenarios can cause the issue (where the idle timeout never triggers due to insufficient threads to handle the idle timeout)
Possibly occurs on environments that have proxies in front of them - nginx has a few bugs with websocket connections, mostly fixed in new versions of nginx. (be sure to set the recommended connection idle timeouts on nginx lower then your jetty server)

and that's just the scenarios we've worked on this issue directly.

jetty / jetty.project