Timeout exeption during reconnect is preventing re-auth to happen?

Adhara3 commented 2 months ago

FTP Server OS: Embedded

FTP Server Type: Custom

Client Computer OS: Windows 10

FluentFTP Version: 50.1.0.0

Framework: .NET Framework 4.7.2

This is a working sequence when Reconnect needed due to disconnected control connection. As you can see there are USER and PASS commands

09-18 19:46:50.114 DEBUG FtpLogger Using client 9bd8a9aa3af444e2b04765423276ff84 
09-18 19:46:50.114 WARN   FtpLogger Reconnect needed due to disconnected control connection 
09-18 19:46:50.114 INFO   FtpLogger Command stashed:  *************** 
09-18 19:46:50.114 INFO  FtpLogger >         Connect(True) 
09-18 19:46:50.114 WARN   FtpLogger Reconnect (Count: 1) 
09-18 19:46:50.114 DEBUG  FtpLogger FluentFTP 50.1.0.0(.NET Framework 4.7.2) FtpClient 
09-18 19:46:50.114 DEBUG  FtpLogger Connecting(sync) FtpClient.FtpSocketStream(control) IP #1 = ***:21 
09-18 19:46:50.114 DEBUG FtpLogger Waiting for a response 
09-18 19:46:50.114 INFO  FtpLogger Response: 220 Daq FTP Server ready. [812ms] 
09-18 19:46:50.114 INFO FtpLogger Command:  USER *** 
09-18 19:46:50.114 DEBUG FtpLogger Waiting for response to: USER *** 
09-18 19:46:50.114 INFO  FtpLogger Response: 331 Password required for ***. [<1ms] 
09-18 19:46:50.114 INFO  FtpLogger Command:  PASS *** 
09-18 19:46:50.114 DEBUG FtpLogger Waiting for response to: PASS *** 
09-18 19:46:50.114 INFO  FtpLogger Response: 230 User *** logged in. [<1ms]

With another client (ID 119a61a653d84b53a8d6b16083bd4c20) I issued a Command: RETR that went Timeout, ok fine. The problem is that Tthen I issued another command through this second client and

09-18 19:46:57.162 DEBUG FtpLogger Using client 119a61a653d84b53a8d6b16083bd4c20
09-18 19:46:49.099 WARN  FtpLogger Reconnect needed due to disconnected control connection 
09-18 19:46:49.099 INFO  FtpLogger Command stashed: *************** 
09-18 19:46:49.099 INFO  FtpLogger >         Connect(True) 
09-18 19:46:49.099 WARN  FtpLogger Reconnect (Count: 3) 
09-18 19:46:49.099 DEBUG FtpLogger FluentFTP 50.1.0.0(.NET Framework 4.7.2) FtpClient 
09-18 19:46:49.099 DEBUG FtpLogger Connecting(sync) FtpClient.FtpSocketStream(control) IP #1 = ***:21 
09-18 19:46:49.099 DEBUG FtpLogger Waiting for a response 
09-18 19:46:49.614 ERROR FtpLogger Exception while executing FTP command. 
System.TimeoutException: Timed out trying to read data from the socket stream!

So it fails while reconnecting. And later on

09-18 19:46:57.162 DEBUG FtpLogger Using client 119a61a653d84b53a8d6b16083bd4c20 
09-18 19:46:57.162 INFO  FtpLogger Command:  *************** 
09-18 19:46:57.162 DEBUG FtpLogger Waiting for response to: ***************  
09-18 19:46:57.162 DEBUG FtpLogger Response: 20 Daq FTP Server ready. 
09-18 19:46:57.162 INFO  FtpLogger Response: 530 Please login with USER and PASS. [<1ms] 
09-18 19:46:57.162 ERROR FtpLogger FtpCommandException while executing FTP command. Error code 530, Response Type PermanentNegativeCompletion.

So despite not being able to complete the reconnect sequence it tries to execute the command but fails. It goes on failing with 530, then later on, a Testing connectivity using Socket.Poll() kicks in which retriggers the reconnect behavior

09-18 19:50:45.045 DEBUG FtpLogger Using client 119a61a653d84b53a8d6b16083bd4c20 
09-18 19:50:45.045 DEBUG FtpLogger Testing connectivity using Socket.Poll()... 
09-18 19:50:45.545 INFO  FtpLogger Command: ***************  
09-18 19:50:45.545 DEBUG FtpLogger Waiting for response to: *************** 
09-18 19:50:45.545 ERROR FtpLogger Exception while executing FTP command. 
System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host

09-18 19:50:46.326 INFO  FtpLogger FTP client interactions took 264 
09-18 19:50:47.076 DEBUG FtpLogger Using client 119a61a653d84b53a8d6b16083bd4c20 
09-18 19:50:47.076 WARN  FtpLogger Reconnect needed due to disconnected control connection 
09-18 19:50:47.076 INFO  FtpLogger Command stashed: ***************
09-18 19:50:47.076 INFO  FtpLogger >         Connect(True) 
09-18 19:50:47.076 WARN  FtpLogger Reconnect (Count: 5) 
09-18 19:50:47.076 DEBUG FtpLogger FluentFTP 50.1.0.0(.NET Framework 4.7.2) FtpClient 
09-18 19:50:47.076 DEBUG FtpLogger Connecting(sync) FtpClient.FtpSocketStream(control) IP #1 = ***:21 
09-18 19:50:47.326 ERROR FtpLogger Exception while executing FTP command. 
System.TimeoutException: Timed out trying to connect!

(The above 3 times more, then)

09-18 19:50:50.107 DEBUG FtpLogger Using client 119a61a653d84b53a8d6b16083bd4c20 
09-18 19:50:50.107 INFO  FtpLogger Command:   *************** 
09-18 19:50:50.107 ERROR FtpLogger Exception while executing FTP command. 
System.IO.IOException: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host

Do you think there could be an issue there? Is this expected behavior? Has it been fixed in 51.x (I couldn't find any evidence in the release notes)?

Thanks a lot A

FanDjango commented 2 months ago

Hi there.

Let's see, here are some first observations and some information.

Very redacted log and only excerpts. So I can't really see the total picture, it's up to you to use the following info and hints so that we can both progress on this.

09-18 19:46:57.162 DEBUG FtpLogger Response: 20 Daq FTP Server ready. 
09-18 19:46:57.162 INFO  FtpLogger Response: 530 Please login with USER and PASS. [<1ms]

20 Daq FTP Server ready. should be 220... Are you sure this is verbatim or is it a typo?

This "20 Daq...." response is not recognised by FluentFTP, thus no USER and PASS command is issued, thus the error 530 is caused.

I don't know this server. Some servers, after repeated unsuccessful attempts (whoever is at fault) will block an IP address for a while as protection against attacks
As an information for you to debug, note that the reconnect attempts are counted.
```
09-18 19:50:47.076 INFO  FtpLogger >         Connect(True) 
09-18 19:50:47.076 WARN  FtpLogger Reconnect (Count: 5) 
```
so that you can get an idea of the sequence that is happening.
If a reconnect is needed, the command that fails because of a close control connection is "stashed". queued, pushed, whatever you want to call it. On successful reconnect, the command environment is restored and then the stashed command is attempted again. You should see this in the log.

Unless you post a complete log, this is all I can do at this moment given the available information. But maybe we can go step by step...

Adhara3 commented 2 months ago

Hi,

first of all, thanks a lot for the reply. I understand that the provided logs are not detailed, but this is all I've got at the moment.

A bit of context now. The server is on an embedded custom system (i do NOT own) and it is very fragile so I have a client pool to avoid spawning to many connections. The GUIDs you see in the logs are identifying such pooled clients. 09-18 19:46:57.162 DEBUG FtpLogger Using client 119a61a653d84b53a8d6b16083bd4c20

Clients are removed from the pool after 30 seconds of inactivity. I usually have 1, max 2 clients active in the pool. What happens here is that one of them is basically stuck, every request is failing but being the only one in the pool I'm screwed (I will add a policy that a client is removed from the pool after N consecutive failures). So logs are are not redacted, there is nothing else going on, those are the successive calls to the same client I'm talking about and they all fail.

I use a client mainly to:

send custom periodic commands (e.g status)
request FileList and download file (this happens every N minutes, so quite rare)

09-18 19:46:57.162 DEBUG FtpLogger Response: 20 Daq FTP Server ready. 
09-18 19:46:57.162 INFO  FtpLogger Response: 530 Please login with USER and PASS. [<1ms]

This is exacly how it appears in the log. Other occurrences in the log of the Daq FTP Server ready do return 220. Could it be that we lost a byte in the reading buffer? (just wondering, it could be that the server returned the wrong code)

3. I don't know this server. Some servers, after repeated unsuccessful attempts (whoever is at fault) will block an IP address for a while as protection against attacks

It could be, however while the above client was failing there was a second one that sometimes got created and added to the pool and was working fine, this is why it looks like the issue is in the state of the failing client.

I posted this request just to see if something obvious could come to your mind. I am aware that the provided information are not detailed, unfortunately as a user/client I'm also on the blind side.

Thanks anyway! A

FanDjango commented 2 months ago

Could it be that we lost a byte in the reading buffer? (just wondering, it could be that the server returned the wrong code)

That would be a possibility, definitely, especially because of this sequence, which is difficult to understand with so small context:

09-18 19:46:57.162 INFO  FtpLogger Command:  *************** 
09-18 19:46:57.162 DEBUG FtpLogger Waiting for response to: ***************  
09-18 19:46:57.162 DEBUG FtpLogger Response: 20 Daq FTP Server ready.

What was the command (***)?

The response seems to be out of sync. It should have been a response to a connect.

Do you have logs separated for each client?

FanDjango commented 2 months ago

it could be, however while the above client was failing there was a second one that sometimes got created and added to the pool and was working fine, this is why it looks like the issue is in the state of the failing client.

SslStream could be unclean as to multiple simultaneous auth processes in an old .NET (4.7.2 !!!), I remember some such difficult to resolve issues from one or two years ago. You could perhaps at least disprove that, by locking your parallel processes around a connect - and this, I realise, you cannot do so easily, as the reconnect happens internally beyond your control.

Maybe, try to answer the obvious first question: Why does the initial reconnect happen in the first place? What goes wrong and then causes the whole set of subsequent problems?

FanDjango commented 2 months ago

I use a client mainly to: send custom periodic commands (e.g status) request FileList and download file (this happens every N minutes, so quite rare)

Yes, I remember your setup.

FanDjango commented 2 months ago

09-18 19:46:57.162 DEBUG FtpLogger Response: 20 Daq FTP Server ready.

I would add a field after "DEBUG", showing the client ID, for each line. And then post the entire log for all clients in sequence. I would like to see what happens between the simultaneous clients in exact sequence to identify what might be interfering with each other.

Adhara3 commented 2 months ago

Maybe, try to answer the obvious first question: Why does the initial reconnect happen in the first place? What goes wrong and then causes the whole set of subsequent problems?

I may know this one. For historical reasons a file download request is preceded by a custom stop command and followed by a start. Those should not impact FTP server activity but, sometimes, for let's call them security reasons the sequence may trigger a system reboot that of course impacts the FTP server that basically goes offline. I know, this is total shit, I agree and I do not expect this library to solve this mess. Moreover, the library is actually doing a great job 95% of the times, recovering well from such a disgraceful behavior. So the fact that it does not recover here is of course an edge case FluentFTP could probably ignore but it's also weird and I was just wondering if it was triggering some internal hidden bug/race condition. Again: I do not expect this library to solve the above mess.

Possible mitigations on my side:

avoid the stop start thing. I already have the option but it's ture that it could lead to other unknown (business related) behaviors
avoid keeping in the pool a broken client a.k.a trust FluentFTP ability to recover but be also kind, if it can't, Dispose the client and create a new one

processes in an old .NET (4.7.2 !!!)

I know, I know. So many legacy components that migrating is not easy... (actually we are on 4.8, but ok, still Framework)

I would add a field after "DEBUG", showing the client ID, for each line. And then post the entire log for all clients in sequence. I would like to see what happens between the simultaneous clients in exact sequence to identify what might be interfering with each other.

For readability I removed the ThreadID/name that follows severity in logs. That would do the same job because client pool is thread safe but a client is used by a thread at time. I'll try to post some logs with thread name, should help

Thanks again A

FanDjango commented 2 months ago

the sequence may trigger a system reboot that of course impacts the FTP server that basically goes offline.

Ah, I see. And you don't know it happens, so the control connection dies. And thus the problem begins and then escalates.

A suggestion: You could do a API "IsStillConnected" for some 30-60 seconds, every 5 seconds, before the next REAL command to the server. If it fails, then you wait, say, 90 seconds for the server to "reboot" and be ready?

FanDjango commented 2 months ago

Any news on this? @Adhara3 I'm sure we could debug this in more detail.

robinrodricks / FluentFTP

Timeout exeption during reconnect is preventing re-auth to happen? #1646