robinrodricks / FluentFTP

An FTP and FTPS client for .NET & .NET Standard, optimized for speed. Provides extensive FTP commands, File uploads/downloads, SSL/TLS connections, Automatic directory listing parsing, File hashing/checksums, File permissions/CHMOD, FTP proxies, FXP support, UTF-8 support, Async/await support, Powershell support and more. Written entirely in C#.
MIT License
3.14k stars 656 forks source link

Timeout Exception in production service after Dotnet 8 Upgrade + FluentFtp 49 -> 50 #1624

Closed funkel1989 closed 1 month ago

funkel1989 commented 3 months ago

FTP Server OS: unknown

FTP Server Type: no sure, connecting to HostedFtp.com SaaS service

Client Computer OS: Windows

FluentFTP Version: 50.1.0

Framework: .NET 8

After updating to dotnet 8 on azure isolated functions when connecting to our SaaS Ftp server while deployed we are occasionally seeing the below error

System.TimeoutException:
   at FluentFTP.Client.Modules.ConnectModule+<AutoDetectAsync>d__2.MoveNext (FluentFTP, Version=50.0.1.0, Culture=neutral, PublicKeyToken=f4af092b1d8df44f)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at FluentFTP.AsyncFtpClient+<AutoDetect>d__3.MoveNext (FluentFTP, Version=50.0.1.0, Culture=neutral, PublicKeyToken=f4af092b1d8df44f)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at FluentFTP.AsyncFtpClient+<AutoConnect>d__2.MoveNext (FluentFTP, Version=50.0.1.0, Culture=neutral, PublicKeyToken=f4af092b1d8df44f)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at UniteConnect.Core.IO.Ftp.FluentFtpService+<>c__DisplayClass11_0+<<GetOrCreateClientAsync>b__4>d.MoveNext (UniteConnect.Core.IO.Ftp, Version=2024.7.30.23, Culture=neutral, PublicKeyToken=null: C:\BuildAgents\agent1\_work\1\s\src\UniteConnect.Core.IO.Ftp\FluentFtpService.cs:283)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Polly.Retry.AsyncRetryEngine+<ImplementationAsync>d__0`1.MoveNext (Polly, Version=7.0.0.0, Culture=neutral, PublicKeyToken=c8a3ffc3f8f825cc)

Our usage looks like this:

private static readonly ConcurrentDictionary<string, AsyncFtpClient> ClientCache = new();
private static readonly ConcurrentDictionary<string, SemaphoreSlim> ClientSemaphoreCache = new();

private const int ConnectionTimeoutInSeconds = 10 * 1000;
private const int MaxRetryAttempts = 7;

 private static AsyncFtpClient CreateClient(FtpCredential credential, bool useEncryption, ILogger logger)
 {
     var config = new FtpConfig
     {
         EncryptionMode = useEncryption ? FtpEncryptionMode.Auto : FtpEncryptionMode.None,
         ConnectTimeout = ConnectionTimeoutInSeconds,  // Pend a login attempt for 10 seconds before failing
         RetryAttempts = 3
     };

     var ftpClient = credential.Port.HasValue
         ? new AsyncFtpClient(credential.Host, credential.Username, credential.Password, credential.Port.Value, config)
         : new AsyncFtpClient(credential.Host, credential.Username, credential.Password, config: config);

     // Below added to improve logging
     ftpClient.Config.LogUserName = true;
     ftpClient.Config.LogHost = true;
     ftpClient.Config.LogDurations = true;
     ftpClient.Config.LogToConsole = true;

     return ftpClient;
 }

 private async static Task<AsyncFtpClient> GetOrCreateClientAsync(FtpCredential credential, bool useEncryption, ILogger logger)
 {
     var key = GetClientCacheKey(credential);

     var client = ClientCache.GetOrAdd(key, _ => CreateClient(credential, useEncryption, logger));
     var semaphore = ClientSemaphoreCache.GetOrAdd(key, _ => new SemaphoreSlim(1, 1));

     await semaphore.WaitAsync();

     try
     {
         // IsStillConnected validates the state of the connection to the server by sending a NOOP and expecting a reply.
         // If the server is non-responsive we will re-connect. This allows us to not have to worry about maintaining connections
         // to the server as the max amount of connections at any one time is equal to the number of active hosts running FTP actions per tenant.
         if (await client.IsStillConnected(ConnectionTimeoutInSeconds)) return client;

         var retryPolicy = Policy.Handle<Exception>().WaitAndRetryAsync(
             MaxRetryAttempts,
             retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
             (exception, timeSpan, retryCount, context) =>
             {
                 logger.LogError(exception, "GetOrCreateClientAsync: Error AutoConnecting client on attempt {RetryCount}. Waiting {TimeSpan} before next retry. Exception: {Message}",
                     retryCount,
                     timeSpan,
                     exception.Message);
             });

         await retryPolicy.ExecuteAsync(async () => await client.AutoConnect());

         return client;
     }
     finally
     {
         semaphore.Release();
     }
 }

Previous version for FluentFtp was 49.0.2. Does AutoConnect still work in the same way? I feel like AutoConnect used to re-initialize the connection if it was closed but that is where I'm getting the Timeouts.

Logs :


I've just enabled verbose logging and pushed the code up. I'll update here when I have it.
FanDjango commented 3 months ago

Ok, I have some questions:

we are occasionally seeing the below error

After updating, you occasionally see the error. Before updating, the error was never to be seen?

49.x.x -> 50.x.x There were certainly a lot of fixes and changes.

But:

I specifically remember (but I could be wrong) that async connect itself was not correctly honoring the timeout time that the user set - it was always the system default (which is longer that your 10 seconds). Thus now with fixed timeout handling, your 10 seconds are actually taking effect and might be on the edge of being to short.

Have you tried increasing the timeout a little?

Beyond that I am looking forward to a verbose log.

I feel like AutoConnect used to re-initialize the connection if it was closed but that is where I'm getting the Timeouts.

You might want to elaborate on the statement a little bit. Huh?

funkel1989 commented 3 months ago

@FanDjango Here is a snippet of the logs we are getting when this issue is occurring.

[2024-08-01T15:58:03.041Z] # AutoConnect()
[2024-08-01T15:58:03.042Z] # AutoDetect(CloneConnection = False, FirstOnly = True, IncludeImplicit = True, AbortOnTimeout = True, RequireEncryption = False, ProtocolPriority = [Tls11, Tls12])
[2024-08-01T15:58:03.043Z] # Disconnect()
[2024-08-01T15:58:03.044Z] Command:  QUIT
[2024-08-01T15:58:03.046Z] Warning:  AsyncFtpClient.Disconnect(): Exception caught and discarded while closing control connection : Unable to write data to the transport connection: An established connection was aborted by the software in your host machine..
[2024-08-01T15:58:03.047Z] Status:   Auto-Detect trying encryption mode "Auto" with "Tls11, Tls12"
[2024-08-01T15:58:03.048Z] # Connect(False)
[2024-08-01T15:58:03.053Z] Status:   FluentFTP 50.1.0.0(.NET 6.0) AsyncFtpClient
[2024-08-01T15:58:03.055Z] # Disconnect()
[2024-08-01T15:58:03.057Z] Command:  QUIT
[2024-08-01T15:58:03.058Z] Warning:  AsyncFtpClient.Disconnect(): Exception caught and discarded while closing control connection : Unable to write data to the transport connection: An established connection was aborted by the software in your host machine..
[2024-08-01T15:58:03.060Z] Status:   Connecting(async) AsyncFtpClient.FtpSocketStream(control) IP #1 = 204.236.238.164:21
[2024-08-01T15:58:03.080Z] Status:   Waiting for a response
[2024-08-01T15:58:18.180Z] # Disconnect()
[2024-08-01T15:58:18.181Z] Command:  QUIT
[2024-08-01T15:58:18.184Z] Warning:  AsyncFtpClient.Disconnect(): Exception caught and discarded while closing control connection : Unable to write data to the transport connection: An established connection was aborted by the software in your host machine..
[2024-08-01T15:58:22.104Z] # AutoConnect()
[2024-08-01T15:58:22.105Z] # AutoDetect(CloneConnection = False, FirstOnly = True, IncludeImplicit = True, AbortOnTimeout = True, RequireEncryption = False, ProtocolPriority = [Tls11, Tls12])
[2024-08-01T15:58:22.106Z] # Disconnect()
[2024-08-01T15:58:22.107Z] Command:  QUIT
[2024-08-01T15:58:22.109Z] Warning:  AsyncFtpClient.Disconnect(): Exception caught and discarded while closing control connection : Unable to write data to the transport connection: An established connection was aborted by the software in your host machine..
[2024-08-01T15:58:22.110Z] Status:   Auto-Detect trying encryption mode "Auto" with "Tls11, Tls12"
[2024-08-01T15:58:22.111Z] # Connect(False)
[2024-08-01T15:58:22.112Z] Status:   FluentFTP 50.1.0.0(.NET 6.0) AsyncFtpClient
[2024-08-01T15:58:22.114Z] # Disconnect()
[2024-08-01T15:58:22.115Z] Command:  QUIT
[2024-08-01T15:58:22.116Z] Warning:  AsyncFtpClient.Disconnect(): Exception caught and discarded while closing control connection : Unable to write data to the transport connection: An established connection was aborted by the software in your host machine..
[2024-08-01T15:58:22.117Z] Status:   Connecting(async) AsyncFtpClient.FtpSocketStream(control) IP #1 = 204.236.238.164:21
[2024-08-01T15:58:22.152Z] Status:   Waiting for a response
[2024-08-01T15:58:37.249Z] # Disconnect()
[2024-08-01T15:58:37.250Z] Command:  QUIT
[2024-08-01T15:58:37.253Z] Warning:  AsyncFtpClient.Disconnect(): Exception caught and discarded while closing control connection : Unable to write data to the transport connection: An established connection was aborted by the software in your host machine..

One thing i noticed, its referencing dotnet 6 when we are running dotnet 8.

funkel1989 commented 3 months ago

@FanDjango One more thing to note, We are still waiting to see if there is any more reproduction when deployed but we found that this issue was able to be re-created on our local machines across multiple engineers and when downgrading to 49.0.2 (our previous) we are no longer encountering the issues locally but can't guarantee this is the fix yet as the issue is sporadic to begin with. We are going to deploy with FluentFtp V49 and see what happens overnight and I'll update you on that tomorrow.

FanDjango commented 3 months ago

One thing i noticed, its referencing dotnet 6 when we are running dotnet 8.

You are being serviced by the .NET 6 targeted assembly of FluentFTP - we have no higher than that in NuGet, currently.

FanDjango commented 3 months ago

and when downgrading to 49.0.2 (our previous) we are no longer encountering the issues locally but can't guarantee this is the fix yet as the issue is sporadic to begin with

Fair enough. Let's give it a day or too now that you are focusing on this.

this issue was able to be re-created on our local machines across multiple engineers (sic)

That means that I could give you our bleeding edge branch to run that against - it certainly has more and clearer logging of the things that are happening. Would you be available for that? You current logging snippet is singularly devoid of any serious failure messages (Disregard this one: [2024-08-01T15:58:03.058Z] Warning: AsyncFtpClient.Disconnect(): Exception caught and discarded while closing control connection : Unable to write data to the transport connection: An established connection was aborted by the software in your host machine..).

And yes:

I feel like AutoConnect used to re-initialize the connection if it was closed but that is where I'm getting the Timeouts.

Phew, I have checked and now I understand what you meant by your comment: In V50.1.0 there actually is a failure to reuse a Client for multiple connect-disconnect sequences and this certainly might make AutoConnect fail if a retry is needed. This I have fixed for in the current master and it is waiting for release. So that might be the first thing to investigate: Use the current master locally and report back, please.

Means: AutoConnect succeeds when the first try connects. Subsequent attempts fail?

funkel1989 commented 3 months ago

@FanDjango Can confirm. reverting back to 49.0.2 has resolved all of our timeout issues. We have not had an opportunity to test from the main branch yet.

FanDjango commented 3 months ago

Ok.

Let's divide this into topics possibly contributing to the problem:

  1. Re-using a client for another connect was flawed between 49.0.2 and 50.1.0. I discovered this myself a few weeks ago. Since AutoConnect will use this method for its retries, it will fail on a retry due to the flaw.
  2. I see in your code that you yourself will also retry, which also suffers from the flaw.
  3. A user reported that connect did not correctly honour the timeout set by the user, taking the system default instead (which is quite long). We fixed this for 50.1.0, it wasn't a big thing worth noting. But now your effective timeout might be shorter than it was before - maybe you never noticed that your timeout was quite gracious instead of the one that you were setting in your code.

Since azure stuff is notoriously slow in some areas, if you combine the above 3 points you can see that there might be an explanation - and some avenues for you to investigate.

Best would be to wait for the upcoming 51.0.0 (which fixes the "flaw" and gives more information in logging) on Nuget and stay with 49.0.2 until then. We will keep this open and then re-evaluate the issue you reported.

That work for you?

funkel1989 commented 3 months ago

@FanDjango That works for me. Thanks for being so diligent!

FanDjango commented 3 months ago

@funkel1989 V51.0.0 has been released to Nuget. At your convenience, you may want to check it out locally. I would appreciate verbose logs that bracket the failure and we can go on from there.