sshnet / SSH.NET

SSH.NET is a Secure Shell (SSH) library for .NET, optimized for parallelism.
http://sshnet.github.io/SSH.NET/
MIT License
3.88k stars 919 forks source link

Hung Dispose() caused by infinite socket timeout #1275

Closed jkillingsworth closed 6 months ago

jkillingsworth commented 6 months ago

Problem

From time to time, my application hangs indefinitely. My application is a headless system that runs on a schedule. When the system hangs, it requires a human operator to notice that the system is hung and then kill the process manually. I have asked the operator to capture process dumps before killing the process. I have analyzed several of these dumps with WinDbg. They each indicate that the system hangs during an SFTP operation.

Based on the information I have gathered, it looks like the problem is a combination of two separate threads that are hanging indefinitely. One thread is waiting for a lock. The other thread is waiting for a socket read operation to complete.

Evidence from WinDbg that a thread is waiting for a lock:

OS Thread Id: 0x27d0 (12)
        Child SP               IP Call Site
00000038d72b8e28 00007ff9f4066974 [GCFrame: 00000038d72b8e28] 
00000038d72b8f48 00007ff9f4066974 [GCFrame: 00000038d72b8f48] 
00000038d72b8fa8 00007ff9f4066974 [HelperMethodFrame_1OBJ: 00000038d72b8fa8] System.Threading.Monitor.Enter(System.Object)
00000038d72b90a0 00007ff98bbbbd1a Renci.SshNet.Session.IsSocketConnected()
00000038d72b9110 00007ff98bbbba9d Renci.SshNet.Channels.Channel.Close()
00000038d72b9160 00007ff98bbbb9f0 Renci.SshNet.Channels.ChannelSession.Close()
00000038d72b9190 00007ff98bbbb6cd Renci.SshNet.Channels.Channel.Dispose(Boolean)
00000038d72b91d0 00007ff98bbbb417 Renci.SshNet.Channels.ChannelSession.Dispose(Boolean)
00000038d72b9210 00007ff98bbbb3ca Renci.SshNet.Channels.Channel.Dispose()
00000038d72b9240 00007ff98bbbb02a Renci.SshNet.SubsystemSession.Disconnect()
00000038d72b9280 00007ff98bbbaec0 Renci.SshNet.SubsystemSession.Dispose(Boolean)
00000038d72b92b0 00007ff98bbbae67 Renci.SshNet.Sftp.SftpSession.Dispose(Boolean)
00000038d72b92f0 00007ff98bbbae1a Renci.SshNet.SubsystemSession.Dispose()
00000038d72b9320 00007ff98bbbadc5 Renci.SshNet.SftpClient.OnDisconnecting()
00000038d72b9350 00007ff98bbbad3d Renci.SshNet.BaseClient.Disconnect()
00000038d72b9380 00007ff98bbbacc0 Renci.SshNet.BaseClient.Dispose(Boolean)
00000038d72b93b0 00007ff98bbbac57 Renci.SshNet.SftpClient.Dispose(Boolean)
00000038d72b93f0 00007ff98bbbac15 ////////////////////REDACTED////////////////////.Dispose()
00000038d72b9430 00007ff98bb7219b ////////////////////REDACTED////////////////////
00000038d72bcef8 00007ff9e6e3db92 [HelperMethodFrame: 00000038d72bcef8] 
00000038d72bcfe0 00007ff98bbb6fe5 Renci.SshNet.SubsystemSession.WaitOnHandle(System.Threading.WaitHandle, Int32)
00000038d72bd030 00007ff98bbb9432 Renci.SshNet.Sftp.SftpSession.RequestReadDir(Byte[])
00000038d72bd0b0 00007ff98bbb8992 Renci.SshNet.SftpClient.InternalListDirectory(System.String, System.Action`1)
--------------------TRUNCATED--------------------

Evidence from WinDbg that another thread is waiting for a socket read operation to complete:

OS Thread Id: 0x1414 (10)
        Child SP               IP Call Site
00000038d633e778 00007ff9f4065ea4 [InlinedCallFrame: 00000038d633e778] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
00000038d633e778 00007ff98a4ac501 [InlinedCallFrame: 00000038d633e778] System.Net.UnsafeNclNativeMethods+OSSOCK.recv(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
00000038d633e750 00007ff98a4ac501 DomainBoundILStubClass.IL_STUB_PInvoke(IntPtr, Byte*, Int32, System.Net.Sockets.SocketFlags)
00000038d633e800 00007ff98a4ae819 System.Net.Sockets.Socket.Receive(Byte[], Int32, Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError ByRef)
00000038d633e880 00007ff98bb94922 Renci.SshNet.Abstractions.SocketAbstraction.Read(System.Net.Sockets.Socket, Byte[], Int32, Int32, System.TimeSpan)
00000038d633e8f0 00007ff98bb95c5f Renci.SshNet.Session.TrySocketRead(System.Net.Sockets.Socket, Byte[], Int32, Int32)
00000038d633e920 00007ff98bb95a40 Renci.SshNet.Session.ReceiveMessage(System.Net.Sockets.Socket)
00000038d633e9b0 00007ff98bb9517e Renci.SshNet.Session.MessageListener()
00000038d633ea00 00007ff9e3ec00b2 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
00000038d633ead0 00007ff9e3ebff35 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
00000038d633eb00 00007ff9e3e8bc06 System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
00000038d633eb40 00007ff9e3e8ae46 System.Threading.ThreadPoolWorkQueue.Dispatch()
--------------------TRUNCATED--------------------

My application code looks something like this:

using (var client = new SftpClient(connectionInfo))
{
    client.OperationTimeout = TimeSpan.FromSeconds(30);
    client.Connect();

    var items = client.ListDirectory("/");

    // ...
}

[!NOTE] There is an implicit call to client.Dispose() at the end of the using block in the code above.

Reproduction

My theory is that the following sequence of events occurs:

  1. The client connects to the server successfully.
  2. The client attempts to do a list directory operation.
  3. The main thread invokes a secondary thread to perform the operation.
  4. The secondary thread acquires a lock.
  5. The secondary thread attempts a socket read operation.
  6. The socket read operation never completes.
  7. The main thread throws an SshOperationTimeoutException exception.
  8. The Dispose() method of the SftpClient class is called implicitly at the end of the using block.
  9. The Dispose() method attempts to acquire the lock in the main thread.
  10. The main thread blocks because the secondary thread still has the lock.
  11. The secondary thread never releases the lock.
  12. Both threads are blocked indefinitely.

The evidence I have gathered so far supports this theory. I have not been able to reproduce this problem the way it happens in the wild. However, I have been able to simulate this scenario with some small code changes. I have created a bugdemo-01 branch in my fork that demonstrates this scenario. You can find the code here:

https://github.com/jkillingsworth/SSH.NET/commit/d8e46c8cc6d8714d50178130fc103aa15ac11f47

If you want a local copy, you can clone it like this:

git clone --branch bugdemo-01 https://github.com/jkillingsworth/SSH.NET.git

There is a RunBugDemo project I added to the solution. You can build and run the code to reproduce this scenario.

screenshot-01

When you run it, the process hangs indefinitely. You can use task manager to capture a dump of the hung process.

screenshot-02

You can open the dump file with WinDbg for analysis. If you enter the !threads command, you can see all the threads in the process.

screenshot-03

Keep in mind the thread identifiers might be different for each unique dump. Also notice the lock counts in the screenshot above.

Enter ~0s and then !clrstack to see the stack trace for the main thread.

screenshot-04

Observe System.Threading.Monitor.Enter where it is waiting for the lock.

Enter ~4s and then !clrstack to see the stack trace for the secondary thread.

screenshot-05

Observe System.Net.Sockets.Socket.Receive where it is waiting to read data from the socket.

Debugging

I have included three hardcoded breakpoints in my demo code that simulates this bug. These breakpoints are placed at the key points necessary to understand this issue. You can run the RunBugDemo project with a debugger attached to hit the breakpoints. If using Visual Studio, you might find the thread debugging tools helpful.

From the menu bar

Debug ⟶ Windows ⟶ Parallel Stacks Debug ⟶ Windows ⟶ Threads

You can open these tool windows when the debugger is running.

Breakpoint #1

Line 1204. While performing the list directory operation, the secondary thread acquires a lock on the _socketReadLock object.

screenshot-06

Line 241. The main thread waits while the secondary thread continues.

screenshot-07

Breakpoint #2

Line 256. The main thread throws an exception after the timeout interval has elapsed.

screenshot-08

Line 309. The secondary thread is trying to read from a socket when the timeout exception is thrown in the main thread.

screenshot-09

Breakpoint #3

Line 1798. The main thread tries to acquire a lock on the _socketReadLock object. It blocks here because the secondary thread still has the lock. Notice in the Parallel Stacks pane that this is deep inside the Dispose() call.

screenshot-10

Line 309. The secondary thread is still trying to read from a socket. The socket read operation never completes, and so the secondary thread never releases the lock.

screenshot-11

Solution

I don't fully understand the threading model for this library, so I am not in a position to propose the best possible solution. However, I think it would be wise to use finite timeouts when performing socket operations. Right now, the socket timeouts are hardcoded to use an infinite timeout value with no way for the user to override. A hardcoded timeout of, say, 30 seconds would be a more reasonable value in my estimation.

The following screenshot shows the expected behavior of my demo program.

screenshot-12

Instead of hanging indefinitely, I would expect it to throw an exception if there is a problem reading from the socket. My application can handle exceptions such as these without manual intervention. It is the infinite hangs that are causing problems.

jkillingsworth commented 6 months ago

The issue reported by @aniknaemmm in #688 looks like it might have the same root cause.

Rob-Hague commented 6 months ago

Beautiful description.

The "secondary thread" in your description is the message listener which runs for the duration of the session, so while infinite timeout could be a smell, there is at least some reason to it.

I don't know whether Session.IsSocketConnected really needs to poll the socket in addition to checking the Connected property:

https://github.com/sshnet/SSH.NET/blob/326ce14c4caff3fdb8bc6bf4c29eeb535890dbcb/src/Renci.SshNet/Session.cs#L1783-L1798

I wonder if it could only attempt to enter the lock: if it can't, then the message listener is still running and we are still connected. And if it can, then the message listener might be stopped and we can perform the poll to get a clearer answer.

Something like this:


diff --git a/src/Renci.SshNet/Session.cs b/src/Renci.SshNet/Session.cs
index 5bf6d8ee..d68257f5 100644
--- a/src/Renci.SshNet/Session.cs
+++ b/src/Renci.SshNet/Session.cs
@@ -1789,11 +1789,20 @@ namespace Renci.SshNet
                     return false;
                 }

-                lock (_socketReadLock)
+                if (!Monitor.TryEnter(_socketReadLock))
+                {
+                    return true;
+                }
+
+                try
                 {
                     var connectionClosedOrDataAvailable = _socket.Poll(0, SelectMode.SelectRead);
                     return !(connectionClosedOrDataAvailable && _socket.Available == 0);
                 }
+                finally
+                {
+                    Monitor.Exit(_socketReadLock);
+                }
             }
         }
jkillingsworth commented 6 months ago

Hi @Rob-Hague. Thanks for the response. I like your suggestion. I tested this solution with my bug demo program and it seems to work. I believe this will solve the problem I am having in my production environment as well.

I have created a pull request. See #1280.

Rob-Hague commented 6 months ago

Fixed by #1280

WojciechNagorski commented 6 months ago

The 2023.0.1 version has been released to Nuget: https://www.nuget.org/packages/SSH.NET/2023.0.1