Dropped packets when using connection pool

MichalCho commented 5 years ago

Hi! I have observed a disturbing issue and I wonder if it is related to the way how I use ldapsdk or there is a bug in the connection handling.

In my application I use a connection pool that is shared among many threads. When i receive a new input data I checkout connection from the pool and perform an asynch operation. I also store it in my own queue so I can use it again before the answer arrives - as I return connection to the pool after receiving a response to my requests (when ActiveOperation counter is <1). However during network problem (some packets lost) I can see that all connections are stuck with

    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:117)
    - waiting to lock <0x00000000f18c4800> (a java.io.BufferedOutputStream)
    at com.unboundid.util.ByteStringBuffer.write(ByteStringBuffer.java:1661)
    at com.unboundid.asn1.ASN1Buffer.writeTo(ASN1Buffer.java:1011)
    at com.unboundid.ldap.sdk.LDAPConnectionInternals.sendMessage(LDAPConnectionInternals.java:561)
    at com.unboundid.ldap.sdk.LDAPConnection.sendMessage(LDAPConnection.java:4451)
    at com.unboundid.ldap.sdk.SearchRequest.processAsync(SearchRequest.java:1413)
    at com.unboundid.ldap.sdk.LDAPConnection.asyncSearch(LDAPConnection.java:4254)

while listeners are up and ok:

   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
    - locked <0x00000000f1b08358> (a java.io.BufferedInputStream)
    at com.unboundid.asn1.ASN1StreamReader.read(ASN1StreamReader.java:1159)
    at com.unboundid.asn1.ASN1StreamReader.readType(ASN1StreamReader.java:332)
    at com.unboundid.asn1.ASN1StreamReader.beginSequence(ASN1StreamReader.java:1079)
    at com.unboundid.ldap.protocol.LDAPMessage.readLDAPResponseFrom(LDAPMessage.java:1151)
    at com.unboundid.ldap.sdk.LDAPConnectionReader.run(LDAPConnectionReader.java:225)

Is there a way to abort the sending the request in case it takes too much time? As I can see that send-q in netstat are being full while recv-q are empty- it means that sending requests is not really working.

Thanks a lot for your answer. Br Michal

dirmgr commented 5 years ago

The write stack trace shows that the LDAP SDK is blocked on a call to OutputStream.write, which indicates that the socket's send buffer is full. This generally means one of two things:

Something is preventing the directory server from receiving the packet, and is preventing the client from getting any indication that the packet didn't arrive. This could be something like a firewall that's configured to silently drop packets, some broken network equipment, a system in which the OS is hung, or a system that's just not there at all.
The packet is making it to the directory server, but it's not processing things fast enough and its receive buffer is filling up. When that happens, outbound packets are queued up in the client's send buffer, and when that fills up, then the call to OutputStream.write will block.

The stack trace from the connection reader shows that it's waiting for any data to be received from the server. There's really nothing more that can be determined from that. All we know is that, at the instant you got the stack trace, it wasn't actively receiving any data or processing any data that it had received.

When the LDAP SDK sends data, it does use the SO_TIMEOUT socket option to place a limit on how long the operation should be allowed to block. The value set for the SO_TIMEOUT socket option comes from the response timeout for the request that you're sending. By default, it comes from LDAPConnectionOptions.getResponseTimeoutMillis(OperationType), but you can override it on a per-request basis by calling the request's setResponseTimeoutMilis method. If an attempt to write to the socket output stream takes longer than this, then the JVM should throw a SocketTimeoutException, which is a type of IOException. The LDAP SDK will catch that and throw an LDAPException with a SERVER_DOWN result code and a message like "An error occurred while attempting to send the LDAP message to server {host:port}: {cause}".

MichalCho commented 5 years ago

Hi Neil, as usual thanks a lot for a very fast answer. Unfortunately it does not seems to work like this. In case of regular request timeout I receive a timeout exception. But in a case where all my threads are BLOCKED in

at com.unboundid.util.ByteStringBuffer.write(ByteStringBuffer.java:1661)

no exception is thrown.

dirmgr commented 5 years ago

You're right. Unfortunately, it looks like SO_TIMEOUT only applies to reads and not to writes, and it doesn't look like Java provides a way for a call to SocketOutputStream.write to time out. It would be necessary to switch to NIO for that, which would be a complete overhaul of the LDAP SDK.

I think that it should be possible to work around this with a custom output stream that uses a background thread to close a connection if it's been blocked on a write attempt for too long, but it'll take some doing.

dirmgr commented 5 years ago

I've just committed an update that changes the way that the LDAP SDK guards against indefinitely blocked writes. It should be included in the next release, but you can use it now if you check out and build the LDAP SDK for yourself.

MichalCho commented 5 years ago

Hi Neil, thanks a lot for your commit. I've checked it and it helped. Unfortunately I've found another issue - in case when a ldap server for example gets killed or a network connection gets lost -and tcp keepalive settings are set to a big values - it may lead to deadlock anyway - when all connections try to write and timer thread can not flush

Thread: WriteWithTimeoutHandler Timer BLOCKED
 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
 at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
 at com.unboundid.util.WriteWithTimeoutTimerTask.run(WriteWithTimeoutTimerTask.java:74)
 at java.util.TimerThread.mainLoop(Timer.java:555)
 at java.util.TimerThread.run(Timer.java:505)

and

Thread: pool-8-thread-2 BLOCKED
 at java.io.BufferedOutputStream.write(BufferedOutputStream.java:117)
 at com.unboundid.util.WriteWithTimeout.write(WriteWithTimeout.java:166)
 at com.unboundid.util.ByteStringBuffer.write(ByteStringBuffer.java:1689)
 at com.unboundid.asn1.ASN1Buffer.writeTo(ASN1Buffer.java:1043)
 at com.unboundid.ldap.sdk.LDAPConnectionInternals.sendMessage(LDAPConnectionInternals.java:562)
 at com.unboundid.ldap.sdk.LDAPConnection.sendMessage(LDAPConnection.java:4452)
 at com.unboundid.ldap.sdk.SearchRequest.processAsync(SearchRequest.java:1413)
 at com.unboundid.ldap.sdk.LDAPConnection.asyncSearch(LDAPConnection.java:4255)

I've played with

net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_retries2

it helps - if a socket timeouts before its buffer get full then it's correctly handled. Unfortunately we can not play with those parameters too much

dirmgr commented 5 years ago

Sorry it took a while to get to this, but I just committed a second change that I think should address this. It now attempts to close the socket rather than the output stream (which may be wrapped in another type of output stream, like a BufferedStream, that could attempt to flush buffered as part of the close).

MichalCho commented 5 years ago

Hi Neil,

thank you! That was exactly my idea - to try to close sockets. I was trying to implement it myself but you were faster. We will test it and provide you a feedback. Br Michal

pingidentity / ldapsdk

Dropped packets when using connection pool #63