Open Adhara3 opened 2 months ago
Hi there.
Let's see, here are some first observations and some information.
Very redacted log and only excerpts. So I can't really see the total picture, it's up to you to use the following info and hints so that we can both progress on this.
09-18 19:46:57.162 DEBUG FtpLogger Response: 20 Daq FTP Server ready.
09-18 19:46:57.162 INFO FtpLogger Response: 530 Please login with USER and PASS. [<1ms]
20 Daq FTP Server ready.
should be 220...
Are you sure this is verbatim or is it a typo?
This "20 Daq...." response is not recognised by FluentFTP, thus no USER and PASS command is issued, thus the error 530 is caused.
I don't know this server. Some servers, after repeated unsuccessful attempts (whoever is at fault) will block an IP address for a while as protection against attacks
As an information for you to debug, note that the reconnect attempts are counted.
09-18 19:50:47.076 INFO FtpLogger > Connect(True)
09-18 19:50:47.076 WARN FtpLogger Reconnect (Count: 5)
so that you can get an idea of the sequence that is happening.
If a reconnect is needed, the command that fails because of a close control connection is "stashed". queued, pushed, whatever you want to call it. On successful reconnect, the command environment is restored and then the stashed command is attempted again. You should see this in the log.
Unless you post a complete log, this is all I can do at this moment given the available information. But maybe we can go step by step...
Hi,
first of all, thanks a lot for the reply. I understand that the provided logs are not detailed, but this is all I've got at the moment.
A bit of context now.
The server is on an embedded custom system (i do NOT own) and it is very fragile so I have a client pool to avoid spawning to many connections. The GUIDs you see in the logs are identifying such pooled clients.
09-18 19:46:57.162 DEBUG FtpLogger Using client 119a61a653d84b53a8d6b16083bd4c20
Clients are removed from the pool after 30 seconds of inactivity. I usually have 1, max 2 clients active in the pool. What happens here is that one of them is basically stuck, every request is failing but being the only one in the pool I'm screwed (I will add a policy that a client is removed from the pool after N consecutive failures). So logs are are not redacted, there is nothing else going on, those are the successive calls to the same client I'm talking about and they all fail.
I use a client mainly to:
09-18 19:46:57.162 DEBUG FtpLogger Response: 20 Daq FTP Server ready.
09-18 19:46:57.162 INFO FtpLogger Response: 530 Please login with USER and PASS. [<1ms]
This is exacly how it appears in the log. Other occurrences in the log of the Daq FTP Server ready
do return 220
. Could it be that we lost a byte in the reading buffer? (just wondering, it could be that the server returned the wrong code)
3. I don't know this server. Some servers, after repeated unsuccessful attempts (whoever is at fault) will block an IP address for a while as protection against attacks
It could be, however while the above client was failing there was a second one that sometimes got created and added to the pool and was working fine, this is why it looks like the issue is in the state of the failing client.
I posted this request just to see if something obvious could come to your mind. I am aware that the provided information are not detailed, unfortunately as a user/client I'm also on the blind side.
Thanks anyway! A
Could it be that we lost a byte in the reading buffer? (just wondering, it could be that the server returned the wrong code)
That would be a possibility, definitely, especially because of this sequence, which is difficult to understand with so small context:
09-18 19:46:57.162 INFO FtpLogger Command: ***************
09-18 19:46:57.162 DEBUG FtpLogger Waiting for response to: ***************
09-18 19:46:57.162 DEBUG FtpLogger Response: 20 Daq FTP Server ready.
What was the command (***)?
The response seems to be out of sync. It should have been a response to a connect.
Do you have logs separated for each client?
it could be, however while the above client was failing there was a second one that sometimes got created and added to the pool and was working fine, this is why it looks like the issue is in the state of the failing client.
SslStream could be unclean as to multiple simultaneous auth processes in an old .NET (4.7.2 !!!), I remember some such difficult to resolve issues from one or two years ago. You could perhaps at least disprove that, by locking your parallel processes around a connect - and this, I realise, you cannot do so easily, as the reconnect happens internally beyond your control.
Maybe, try to answer the obvious first question: Why does the initial reconnect happen in the first place? What goes wrong and then causes the whole set of subsequent problems?
I use a client mainly to: send custom periodic commands (e.g status) request FileList and download file (this happens every N minutes, so quite rare)
Yes, I remember your setup.
09-18 19:46:57.162 DEBUG FtpLogger Response: 20 Daq FTP Server ready.
I would add a field after "DEBUG", showing the client ID, for each line. And then post the entire log for all clients in sequence. I would like to see what happens between the simultaneous clients in exact sequence to identify what might be interfering with each other.
Maybe, try to answer the obvious first question: Why does the initial reconnect happen in the first place? What goes wrong and then causes the whole set of subsequent problems?
I may know this one. For historical reasons a file download request is preceded by a custom stop
command and followed by a start
. Those should not impact FTP server activity but, sometimes, for let's call them security reasons the sequence may trigger a system reboot that of course impacts the FTP server that basically goes offline.
I know, this is total shit, I agree and I do not expect this library to solve this mess. Moreover, the library is actually doing a great job 95% of the times, recovering well from such a disgraceful behavior. So the fact that it does not recover here is of course an edge case FluentFTP could probably ignore but it's also weird and I was just wondering if it was triggering some internal hidden bug/race condition. Again: I do not expect this library to solve the above mess.
Possible mitigations on my side:
stop start
thing. I already have the option but it's ture that it could lead to other unknown (business related) behaviorsDispose
the client and create a new oneprocesses in an old .NET (4.7.2 !!!)
I know, I know. So many legacy components that migrating is not easy... (actually we are on 4.8, but ok, still Framework)
I would add a field after "DEBUG", showing the client ID, for each line. And then post the entire log for all clients in sequence. I would like to see what happens between the simultaneous clients in exact sequence to identify what might be interfering with each other.
For readability I removed the ThreadID/name that follows severity in logs. That would do the same job because client pool is thread safe but a client is used by a thread at time. I'll try to post some logs with thread name, should help
Thanks again A
the sequence may trigger a system reboot that of course impacts the FTP server that basically goes offline.
Ah, I see. And you don't know it happens, so the control connection dies. And thus the problem begins and then escalates.
A suggestion: You could do a API "IsStillConnected" for some 30-60 seconds, every 5 seconds, before the next REAL command to the server. If it fails, then you wait, say, 90 seconds for the server to "reboot" and be ready?
Any news on this? @Adhara3 I'm sure we could debug this in more detail.
FTP Server OS: Embedded
FTP Server Type: Custom
Client Computer OS: Windows 10
FluentFTP Version: 50.1.0.0
Framework: .NET Framework 4.7.2
This is a working sequence when
Reconnect needed due to disconnected control connection
. As you can see there areUSER
andPASS
commandsWith another client (ID 119a61a653d84b53a8d6b16083bd4c20) I issued a
Command: RETR
that wentTimeout
, ok fine. The problem is that Tthen I issued another command through this second client andSo it fails while reconnecting. And later on
So despite not being able to complete the reconnect sequence it tries to execute the command but fails. It goes on failing with 530, then later on, a
Testing connectivity using Socket.Poll()
kicks in which retriggers the reconnect behaviorDo you think there could be an issue there? Is this expected behavior? Has it been fixed in 51.x (I couldn't find any evidence in the release notes)?
Thanks a lot A