javaee / grizzly

Writing scalable server applications in the Java™ programming language has always been difficult. Before the advent of the Java New I/O API (NIO), thread management issues made it impossible for a server to scale to thousands of users. The Grizzly NIO framework has been designed to help developers to take advantage of the Java™ NIO API.
https://javaee.github.io/grizzly/
Other
222 stars 60 forks source link

IdleTimeoutFilter memory leak when FilterChainContext completionListeners are not called #1824

Closed glassfishrobot closed 7 years ago

glassfishrobot commented 8 years ago

I have been making a HTTP service using HttpServer and loading testing it to check how it behaves under very high load. This service will sit behind a reverse proxy, so I expect all the TCP connections to come from the same client host. My setup code looks like this (I have omitted lots of stuff I do not think is relevant to this issue):

final NetworkListener listener = new NetworkListener("listener", addr, port);
listener.getKeepAlive().setIdleTimeoutInSeconds(5);
listener.getKeepAlive().setMaxRequestsCount(10000);
final HttpServer server = new HttpServer();
server.addListener(listener);
server.getServerConfiguration().addHttpHandler(...);
server.start();

Under high load from Gatling many of the requests do not finish gracefully, e.g. Gatling state is:

================================================================================
2016-02-24 09:37:35         405s elapsed
---- Same request --------------------------------------------------------------
[#-        ]  2%
          waiting: 1025231 / active: 1021   / done:23401
---- Requests ------------------------------------------------------------------
> Global   (OK=19264  KO=4137  )
> Select /path/to/my/api/p...              (OK=19264  KO=4137  )
---- Errors --------------------------------------------------------------------
> status.find.not(500), but actually unexpectedly found 500        2337 (56.49%)
> java.io.IOException: Remotely closed             1800 (43.51%)
================================================================================

This is totally expected, but I do not expect it to break the service such that the JVM crashes or it is unusable after the load is removed. However I found that under this load the JVM would run out of memory and start throwing OutOfMemoryError and killing random threads until everything ground to a halt. In a heap dump from the dead (but JVM still running) service I found that org.glassfish.grizzly.utils.DelayedExecutor.DelayQueue.queue was responsible for 80+% of the heap usage.

After a lot of investigation, I have a couple of ideas on what is happening:

1) It might be possible for a connection to die in some way that does not cause events added using org.glassfish.grizzly.filterchain.FilterChainContext.addCompletionListener() to be called. IdleTimeoutFilter depends on that callback being reliable so that org.glassfish.grizzly.utils.IdleTimeoutFilter.ContextCompletionListener.onComplete() is called. If it is not called any connection that does not close cleanly is left in the DelayQueue.queue Map with timeout set to FOREVER} (Long.MAX_VALUE). Thus DelayedExecutor will never remove it from DelayQueue.queue and process it.

2) I noticed that many of the leaked connection objects have timeout set to 9223372036854775806, Long.MAX_VALUE - 1, same as IdleTimeoutFilter.FOREVER_SPECIAL, and counter is still set to 1. But the connection is closed. So I am wondering if it is possible that org.glassfish.grizzly.utils.IdleTimeoutFilter.ContextCompletionListener.onComplete() does not always to the right thing. For example, IdleRecord.timeoutMillisUpdater.compareAndSet(idleRecord, FOREVER_SPECIAL, timeoutToSet) does not check the return value.

I am afraid I have not delved deep enough to figure it out exactly. What I have managed though is to create a temporary work around that cleans up closed connections that still have a timeout set. It resets the timeout to be in the past for connections that are closed so that DelayedExecutor will then process them as expected. Code will be attached.

Environment

Java 1.8.0_71 on CentOS 6.7.

Affected Versions

[2.3.24]

glassfishrobot commented 8 years ago

Reported by fae

glassfishrobot commented 8 years ago

fae said: Work around code here: https://gist.github.com/haku/4c5447c33850947c4862

glassfishrobot commented 8 years ago

oleksiys said: Could you pls. explain how we can reproduce the problem? I mean more step-by-step details.

Thank you.

glassfishrobot commented 7 years ago

@rlubke said: Could you please give this JAR [1] a shot in your environment?

[1] https://www.dropbox.com/s/pmw8vy3ovnbfyrx/grizzly-framework-2.3.29-SNAPSHOT.jar?dl=0

glassfishrobot commented 7 years ago

@rlubke said: Changes applied:

2.3.x: 3c30dc05944f88184123f57d4956a14dfc5815eb master: 8f783a3c8bd7357c77cdd45e7f07d836ada927eb

glassfishrobot commented 7 years ago

This issue was imported from java.net JIRA GRIZZLY-1824

glassfishrobot commented 7 years ago

Marked as fixed on Friday, January 13th 2017, 2:03:22 pm