pingidentity / ldapsdk

UnboundID LDAP SDK for Java
Other
334 stars 81 forks source link

Started thread but failed to stop it #39

Closed HarlemSquirrel closed 6 years ago

HarlemSquirrel commented 6 years ago

I have a JRuby on Rails app running on tomcat which is recording a number of these warnings in catalina.out.

I recently took over ownership of this application from a developer who left our organization so I am still getting familiar with the SDK and the custom gem he built around it that we use with this app. Last week I updated the SDK version in use from 3.1.0 to 4.0.4 in our development app but these log messages are still occurring.

We generally have to restart Tomcat/Apache in production every two weeks or so and I think this may the the reason why.

09-Feb-2018 22:14:50.857 WARNING [localhost-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [Health Check Thread for LDAPConnectionPool(serverSet=RoundRobinServerSet(servers={<SERVER:PORT>}, includesAuthentication=false, includesPostConnectProcessing=false), maxConnections=10)] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
- sun.misc.Unsafe.park(Native Method)
- java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
- java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
- com.unboundid.ldap.sdk.LDAPConnectionPoolHealthCheckThread.run(LDAPConnectionPoolHealthCheckThread.java:109)
09-Feb-2018 22:14:50.858 WARNING [localhost-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [Connection reader for connection 40 to <SERVER:PORT>] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
- java.net.SocketInputStream.socketRead0(Native Method)
- java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
- java.net.SocketInputStream.read(SocketInputStream.java:171)
- java.net.SocketInputStream.read(SocketInputStream.java:141)
- sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
- sun.security.ssl.InputRecord.read(InputRecord.java:503)
- sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
- sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
- sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
- java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
- java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- com.unboundid.asn1.ASN1StreamReader.read(ASN1StreamReader.java:1157)
- com.unboundid.asn1.ASN1StreamReader.readType(ASN1StreamReader.java:331)
- com.unboundid.asn1.ASN1StreamReader.beginSequence(ASN1StreamReader.java:1077)
- com.unboundid.ldap.protocol.LDAPMessage.readLDAPResponseFrom(LDAPMessage.java:1146)
- com.unboundid.ldap.sdk.LDAPConnectionReader.run(LDAPConnectionReader.java:225)
dirmgr commented 6 years ago

LDAP is an asynchronous protocol, which means that multiple requests can be in progress at any time over the same connection (although most clients only use it in a synchronous manner, where they send a single request and then wait for the response before doing anything else on that connection). To support this asynchronous mode, whenever a new LDAP connection is created, the default LDAP SDK behavior is to create a dedicated thread to read responses from the server over that connection and make sure that they get to the thread that is waiting on them.

From this message, it sounds like a number of these connection reader threads are sticking around for a long time. There are a couple of potential reasons for this:

Although it’s conceivable that there is a bug in the LDAP SDK that results in connection whose reader thread is still running even after the connection has been closed, that is very unlikely in this case. Closing a connection closes its associated input stream, and that in turn will cause any blocked call waiting on that stream to immediately become unblocked and indicate that the connection has been closed (either by returning a negative value or by throwing an exception). In this case, the stack trace shows that the thread is blocked on the input stream, which means that the stream is still open.

I would recommend using an OS-level tool like netstat to see how many connections are established between the system on which the application is running and the system on which the directory server is running, or like lsof to see how many network file descriptors your application is using. Or you could use jstack on the Tomcat process and count the number of connection reader threads there are, which should correspond to the number of open connections. If the number of connections is unexpectedly large, then that probably means that you have a case in which something is opening connections and neglecting to close them properly. If the number of open connections is reasonable, then it may be that they are just legitimately open for a long time and Tomcat’s warning probably isn’t something to be too concerned about.

If you find that it doesn’t look like your application is leaking connections, and if it doesn’t try to process multiple operations simultaneously on the same connection (either by using the LDAPConnection.async* methods or by sharing the same connection across multiple threads), then you could open connections in synchronous mode. In this mode, the LDAP SDK will not create a dedicated reader thread for each connection, but will instead use the thread that sends a request to wait for and read the response. To do that, you will need to create an LDAPConnectionOptions object, call its setUseSynchonousMode(true) method, and then provide that LDAPConnectionOptions object to the LDAPConnection constructor when you’re creating the connection. This will ensure that the separate reader thread isn’t created, but if your application is leaking connections, then it probably won’t do too much to help in the long run.

On the other hand, if you think there may be a leak, then you could try using the LDAP SDK’s debugging framework to keep track of when and where connections are being created. Whenever the LDAP SDK establishes a connection, it will record a debug message using the CONNECT debug type, and it will record a similar message whenever a connection is closed. The debug message created when the connection is established will include the internal ID that the LDAP SDK has assigned to the connection (for example, in your case, it’s connection 40), and it can also include a stack trace that can let you see where in your application the connection was created. By comparing the connection IDs from the stack traces with the location in the code where the corresponding connections were created, you might be able to find a case where a connection is established but then not properly closed. See the documentation for the com.unboundid.util.Debug class for more information about the LDAP SDK’s debugging framework.

If you suspect that there may be a connection leak in your application but you can’t find where it is, then another option might be to configure the directory server to close connections after they have been idle for too long. Many LDAP servers provide support for this kind of feature. While it’s not a great solution to the problem, it can be a stopgap measure until you finally figure out where the leak is.

I hope this helps you figure out what the problem is. Please let me know if you still suspect that this is a problem in the LDAP SDK itself, but the stack trace indicates that the associated connection is still open, which suggests an application-level problem.

HarlemSquirrel commented 6 years ago

Thank you very much for all of this detail! I'm going to check on my app code to make sure we are closing connections properly first.

HarlemSquirrel commented 6 years ago

We currently have to following in place after each search request. Below c is expected to be a Java::ComUnboundidLdapSdk::LDAPConnectionPool instance when we are using pools (which we pretty much always do).

def search_using_filter(base, filter, controls={})
  c = connection(base, controls)
  # perform search
ensure
  if c && c.respond_to?("getHealthCheck")
    # We need to check on this...!
    #c.close()
  elsif c && controls[:bind_dn]
    # Note: we are assuming that the config is not set up for
    # a connection pool, because if it is, this is going to
    # mess things. up. 
    c.close()
  elsif c
    c.close()
  end
end

This is how we setup connections

def connection(base, controls=nil)
  source_name = source_name_from_base(base)
  if !source_name
    puts "no source"
    return nil
  end
  server_set = server_set_for_source_name(source_name)
  source = config['sources'][source_name]

  # We can pass in a bind_dn and bind_password in the controls, so we
  # now have to see if they exist, and if so, we use them instead
  # of using the bind info in the configuration file.
  if controls && controls[:bind_dn] && controls[:bind_password]
    bind_dn = controls[:bind_dn]
    bind_password = controls[:bind_password]
  else
    bind_dn = source['bind_dn']
    bind_password = source['bind_password']
  end

  if source['use_connection_pool'].to_java == true
    if controls && (controls[:bind_dn] && controls[:bind_password])
      bind_request = SimpleBindRequest.new(bind_dn, bind_password)
      conn[source_name] = LDAPConnectionPool.new(server_set, bind_request, 1, 1)
    else
      if conn[source_name].nil? || conn[source_name].isClosed()
        if bind_dn
          bind_request =
            SimpleBindRequest.new(bind_dn, bind_password)
          conn[source_name] = LDAPConnectionPool.new(server_set, bind_request,
            source['initial_connections'].to_i, source['max_connections'].to_i)
        else
          bind_request = SimpleBindRequest.new()
          conn[source_name] = LDAPConnectionPool.new(server_set, bind_request,
              source['initial_connections'].to_i, source['max_connections'].to_i)
        end
      end
    end
    return conn[source_name]
  else
    local_conn = server_set.getConnection()
    # If there is a bindDN, then bind, otherwise, treat as
    # anonymous.
    if bind_dn
      local_conn.bind(bind_dn, bind_password)
    end
    return local_conn
  end
end

I looked at the docs on LDAPConnectionPool and invokeHealthCheck. I think I should update how we handle pools after each search by preforming a health check.

def search_using_filter(base, filter, controls={})
  c = connection(base, controls)
  # perform search
ensure
  if c.respond_to? :invoke_health_check
    c.invoke_health_check(nil, true)
  elsif c
    c.close
  end
end
HarlemSquirrel commented 6 years ago

Looking at the connections, it seems to be OK at the moment

$ netstat | grep ed.weill
tcp        0      0 <LOCAL_HOST>:58684 REMOTE_LDAP_SERVER:ldaps  ESTABLISHED 
tcp        0      0 <LOCAL_HOST>:53870 REMOTE_LDAP_SERVER:ldaps  ESTABLISHED 
tcp        0      0 <LOCAL_HOST>:58682 REMOTE_LDAP_SERVER:ldaps  ESTABLISHED 
tcp        0      0 <LOCAL_HOST>:53174 REMOTE_LDAP_SERVER:ldaps  ESTABLISHED 
tcp        0      0 <LOCAL_HOST>:53872 REMOTE_LDAP_SERVER:ldaps  ESTABLISHED

There looks to be 10 connection pool threads which also seems reasonable.

$ sudo jstack -m 13621 | grep java.util.concurrent.ThreadPoolExecutor.runWorker
* java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
* java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
* java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
* java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
* java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
* java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
* java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
* java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
* java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
* java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) bci:26 line:1127 (Interpreted frame)
dirmgr commented 6 years ago

Since it doesn’t seem like the number of established connections is out of line, then it may well be that these connections have just been established for a long time, and therefore their reader threads have been around for a while.

Since it looks like you are using a connection pool, you can use the LDAPConnectionPool.setMaxConnectionAgeMillis method to specify a maximum age for connections in the pool. If you do that, then once a connection has been established for more than that length of time, then the pool will close it and replace it with a newly-created connection (which will also shut down the former reader thread and create a new one to replace it).

Also, if you’re using a connection pool, then it’s usually a better practice to not actually check connections out of the pool and perform operations on the checked-out connections, but instead you can perform operations against the pool itself. For example, instead of doing something like:

LDAPConnection connection = pool.getConnection();
try
{
  SearchResult searchResult = connection.search(searchRequest);
  pool.releaseConnection(connection);
}
catch (LDAPSearchException ldapException)
{
  pool.releaseConnectionAfterException(connection, ldapException);
}

You can just do:

SearchResult searchResult = pool.search(searchRequest);

This does all of the necessary error handling to ensure that the connection properly gets released back to the pool if it’s still valid, and there are also other advantages like the ability to have the pool automatically re-try a failed operation on a newly-created connection if it seems like the failure might have been the result of a connection that is no longer valid.

If you want to try to share code that can use either a connection or a connection pool, both the LDAPConnection and LDAPConnectionPool classes implement the LDAPInterface interface, so you could potentially have the method you currently use to get a connection return an LDAPInterface instead. In the case of a single connection, you’d want to close the connection when you’re done with it, but you obviously don’t want to do that with the pool. however, I presume that your code already handles that difference since you don’t want to close a pooled connection because instead of releasing the connection back to the pool, it will close it and cause a new one to be created in its place.

I’m not all that familiar with Ruby, so I can’t help much with any Ruby-specific issues you might encounter. However, since you asked about invoking health checks, that’s not necessarily something that you want to do manually. The LDAPConnectionPool.invokeHealthCheck method can be used to force the connection pool to examine all connections that aren’t currently in use and try to replace any connections that no longer seem to be valid, along with optionally replacing any connections that have been established for longer than the maximum age. If you have configured the pool with a health check, then this is something that will automatically be done in the background at regular intervals (60 seconds by default, configurable by the setHealthCheckIntervalMillis method).

HarlemSquirrel commented 6 years ago

Thank you very much! This is very helpful. I did finally correct the issue we were having with the app using up memory and then not responding on one of our two servers. It seems it was an issue with cached files from tomcat so I suppose our memory leak warnings are not something to worry about at the moment. I will look deeper into the connection pools if we experience and issues as we move to the cloud.

One last question if you don't mind. I notice that after resting for a while with no activity for several hours or more there seems to be a very slow response in the next query to our Ping LDAP server. Is this due to establishing new pooled connections?

dirmgr commented 6 years ago

It’s hard to say whether that’s because it’s establishing a new connection. Most of the time, the LDAP SDK should be able to detect a case in which the server has closed a connection (for example, because it is idle for too long) right away and immediately re-establish a new connection to take its place.

It is possible that a connection could be invalidated by some networking equipment (for example, a firewall) that sits between the client and the server and cuts off connections that haven’t been used for too long. We have seen that in the past, and it can be a legitimate problem.

The LDAP SDK does provide a couple of ways to address this kind of thing.

HarlemSquirrel commented 6 years ago

OK Thank you!