Transmission terminates midway / Files corrupt

swalkner commented 2 years ago

Hey,

I'm using asyncssh 2.11.0 for a while now and it works just great. With 2.12.0 the transfer often gets stopped (not always) after a few writes and put raises 'Internal server error.'. Sometimes - the transfer finishes - but the file on the server is corrupt (starts with bunch of null-bytes).

I am not sure how to describe the issue better. What I understand is that in 2.11.0 byte-packages have been transmitted from offset 0 - file-end. In 2.12.0 (async) 'parallel' writes have been implemented. But I'm not even sure that belongs to the problem. Attached two log-files with asyncssh DEBUG log (2 times same local-file, same server/auth, etc). The "internal_server_error.log" shows that after two writes the connection is being closed and put throws 'Internal server error.'. The "successful.log" second log shows a successful transmission, but the file on the server is corrupt anyway.

Sorry, that I cannot provide a better description, I'll try to debug further. If you need more information, please do not hesitate to ask :-)

thanks, stefan internal_server_error.log successful.log

swalkner commented 2 years ago

Could this be a limitation of the SFTP/SSH Server probably? I tested it now on a local SSH Server and the file is never corrupt nor is the transmission closed prematurely. The environment, where this library is being used I have to use a provided SFTP server (where I cannot check any logs nor do any further debugging via command-line). :-(

ronf commented 2 years ago

Your description sounds similar to #506. Even if you can't access it, do you have any idea what type of SFTP server this is? From the log, it looks like it might be https://github.com/ronf/asyncssh/issues/510, as it identified itself as "SSH-2.0-JSCAPE". Does that sound right to you?

The other thing I notice is that it appears the file data is doing writes "out of order". The first one that shows up in the log after getting a response to the open() call is:

[2022-09-14 05:41:28,475] DEBUG [asyncssh.sftp] [conn=0, chan=0] Sending write for 16384 bytes at offset 278528 in handle 00000001 [logging.py:92]

The server may not work when requests arrive out of order (which could be caused by sending multiple parallel requests.

A possible workaround for this would be to disable parallelism. Could you try that and see if the problem goes away? You can do this by setting max_requests=1, though this might really slow down the transfers.

I'm not entirely sure why it would start at a non-zero offset like this. I did make some changes to add a new mechanism for triggering parallel reads or writes, but for writes in particular I would have expected it to work similar to how it did before. I'm also not sure about what is printing the debug message "Internal server error". Is that your message? If so, do you have any kind of traceback or other details about the error which triggers this?

swalkner commented 2 years ago

Thank you for your fast response. Sorry, I missed #506 as it referred to v2.11.0 in the subject. "SSH-2.0-JSCAPE" is the real string replied by the server. Unfortunately I do not have more information about that server/provider [huge governmental/public company behind that with slow/no response] :-(

I also noticed the out-of-order writes [I wanted to put that in the subject, but wasn't sure if that's the issue or a feature], but these also appear in the 'successful' logs and also with my test-SSH server - where everything works out fine. I thought that's on purpose :-|

I was not sure about 'Internal server error' neither. It seems it's a response from the server: sftp.py:834 - 'reason' contains this string. The content of _packet is: b'e\x00\x00\x00\x06\x00\x00\x00\x04\x00\x00\x00\x16Internal server error.\x00\x00\x00\x02en' Maybe that's not a fully compliant SFTP server then or has it's own behavior on errors?

You are correct 'max_requests=1' really gets rid of this issues and I could not detect any errors anymore.

thx, stefan

swalkner commented 2 years ago

But _pending is containing all the 'tasks' for writing chunks (offset, size). This is defined as set - wouldn't a set mixup the order of the write commands?

ronf commented 2 years ago

Good point on the #506 title - I've cleaned that up.

Parallel I/O in SFTP is actually not new in 2.12.0. In fact, it goes back to version 1.5.3 released in 2016. However, some changes were made in 2.12.0 to allow the parallel I/O to be visible to callers wanting to do some other than just copy files between local & remote SFTP servers, potentially with responses to reads coming in out of order.

There was no intention to change the ordering of write requests here, though, and so I need to figure out how that's happening. If I can return to those writes being submitted in order, it may avoid problems with this server. That would still allow the writes to complete out of order, and should also still get the speed benefit of having multiple requests in flight, rather than waiting for each write to be acknowledged before sending the next.

While it's true that _pending is a set() (and has been one for a long time), things are scheduled based on an ordering that doesn't involve pulling tasks from _pending. Similarly, the order responses are delivered is based on the order of the server's acknowledgements, at which point things are removed from the pending set. However, I'll take a closer look at that to see if something changed there.

ronf commented 2 years ago

Thanks for confirming the internal server error is in fact a server response. My guess is this server might be sending that as soon as a client attempts a non-sequential write (which is not what you'd expect from an SFTP server).

swalkner commented 2 years ago

Oh - now I finally understand the intentions. Wasn't fully sure. If there is anything I can do or debug - please let me know. thanks so much for the insights.

ronf commented 2 years ago

Thanks - I really appreciate that, as I can't really reproduce the issue here. As soon as I have a potential change to try, I'll let you know!

ronf commented 2 years ago

Which OS are you running AsyncSSH on? I've tried both macOS and Windows here so far, but when using put() I still always see the writes go out in order, or at least start out that way, where your test seems to have the very first read out of order.

In your tests, are you doing anything more than just starting up an SFTP client and calling put() on it?

swalkner commented 2 years ago

I'm using Ubuntu 20.04 on my systems. Usually lots of asyncio tasks are running, but for the test - to reproduce the error - asyncssh was the only asyncio-ish task running. So no - just connecting and put-ing one file. I'll experiment with max_requests and see if e.g. with 2/3/4 the order stays more correct.

swalkner commented 2 years ago

I think the problem is the asynchronous reading. The reads of the local file also happen async and they finish out-of-order. This leads to iter() returning packages out of order already for the writes. So, offset, size, count, result = task.result() does not return offsets in order. Isn't that a general problem of async/parallel reading and then writing? Maybe I'm following/understanding the code/intentions wrong...

ronf commented 2 years ago

Ok - I was able to reproduce this finally this morning. It doesn't happen on macOS or Windows, but I was able to get it to happen on Ubuntu, which is surprising. While I have run into network stack differences between Linux and macOS before, I haven't run into differences in asyncio scheduling between them.

I haven't drilled down to find the root cause of the difference yet, but I'll try to dig into this over the weekend.

It also looks like I probably need to update my unit tests to include a mock server that doesn't accept out-of-order writes, as currently all my tests passed on macOS, Windows, and Linux.

Since allowing reads to complete out of order is actually a feature, I'm not quite sure how to reliably fix this in all cases. I may need to add an optional argument indicating whether to force write requests to be sent in-order. This would still allow parallel writes in flight but would wait for the earliest outstanding read to complete before triggering a corresponding write.

ronf commented 2 years ago

You're absolutely right about the issue being with the reads returning out of order. That's in part probably related to the fact that the return values from wait() are both sets (not only pending, but also done, which we do iterate over), but also one of the points of this parallel I/O class was to allow out of order completion like this.

What I couldn't figure out was why this hasn't been a problem all along, as even the 2.11 and earlier code iterated over the "done" list. After a closer inspection, I realized what it was: In the previous SFTPCopier code, the read() and write() were both done in the same task, with the task not being considered completed until both had finished, and new read/write tasks were only created (in order) when previous writes completed. In the SFTPCopier 2.12 code, though, we only complete the read() portion of this before treating a task as complete, and then perform the write in a parent task. This means the writes occur in the order reads get returned from the iteration over the set returned by wait(), which is unpredictable in its ordering. It also means that writes don't actually get parallelized as well as they did before. This isn't a bit deal for get(), where the writes are going to a local file and always complete "immediately", but it is probably hurting put() performance.

In the old code, writes were basically guaranteed to be in order as long as the server didn't return write acknowledgments out of order, which is probably a safe bet on a system that can't handle random access file I/O.

I'll look into putting the SFTPCopier code back to more like what it was in 2.11 where tasks do both the read & write, rather than iterating over tasks as just the reads complete. I'll let you know when I have a potential fix.

ronf commented 2 years ago

Ok - I have an initial cut at a fix:

diff --git a/asyncssh/sftp.py b/asyncssh/sftp.py
index f7befde..5cc8498 100644
--- a/asyncssh/sftp.py
+++ b/asyncssh/sftp.py
@@ -733,7 +733,7 @@ class _SFTPFileWriter(_SFTPParallelIO[int]):
         async for _ in self.iter():
             pass

-class _SFTPFileCopier(_SFTPParallelIO):
+class _SFTPFileCopier(_SFTPParallelIO[int]):
     """SFTP file copier

        This class parforms an SFTP file copy, initiating multiple
@@ -760,14 +760,17 @@ class _SFTPFileCopier(_SFTPParallelIO):
         self._total_bytes = total_bytes
         self._progress_handler = progress_handler

-    async def run_task(self, offset: int, size: int) -> Tuple[int, bytes]:
-        """Read a block of the source file"""
+    async def run_task(self, offset: int, size: int) -> Tuple[int, int]:
+        """Copy a block of the source file"""

         assert self._src is not None
+        assert self._dst is not None

         data = await self._src.read(size, offset)
+        await self._dst.write(data, offset)
+        datalen = len(data)

-        return len(data), data
+        return datalen, datalen

     async def run(self) -> None:
         """Perform parallel file copy"""
@@ -779,8 +782,8 @@ class _SFTPFileCopier(_SFTPParallelIO):
             if self._progress_handler and self._total_bytes == 0:
                 self._progress_handler(self._srcpath, self._dstpath, 0, 0)

-            async for offset, data in self.iter():
-                if not data:
+            async for offset, datalen in self.iter():
+                if not datalen:
                     exc = SFTPFailure('Unexpected EOF during file copy')

                     setattr(exc, 'filename', self._srcpath)
@@ -788,10 +791,8 @@ class _SFTPFileCopier(_SFTPParallelIO):

                     raise exc

-                await self._dst.write(data, offset)
-
                 if self._progress_handler:
-                    self._bytes_copied += len(data)
+                    self._bytes_copied += datalen
                     self._progress_handler(self._srcpath, self._dstpath,
                                            self._bytes_copied,
                                            self._total_bytes)

This should cover the common case where the selected block size is small enough for all reads to return the full amount of requested data (except at EOF). If a read returns less than the full amount and another read needs to be issued to get the rest of the data, that may still result in out of order writes, but that could be avoided by simply lowering the block size.

If reads actually complete out of order, it's also possible that this will trigger out of order writes, but this shouldn't ever happen with a put() call (since you are reading from the local filesystem), and from what I've seen it doesn't happen on copy() either, as most SFTP servers will still process requests in the order they are received. There's actually wording in the SFTP spec about processing requests in order, but only when dealing with overlapping reads & writes, which isn't happening here, and even then responses don't have to be delivered in order. In practice, though, I think this is pretty safe.

To cover these kind of servers completely, I'd probably need to introduce a new sequential boolean argument on the get/put/copy calls which guaranteed writes would always go out in order, regardless of the order that reads complete. However, I think I'll hold off on that more complex change until someone actually reports that as an issue.

Could you give this change a try and see if it works for you?

swalkner commented 2 years ago

Wow - thanks for your detailed investigation and explanations. I think these assumption/fixes should tackle that problem. I'll test these changes ASAP tomorrow and report back. Thanks so much!

swalkner commented 2 years ago

This looks very good to me and I was not able to reproduce the error anymore.

ronf commented 2 years ago

Thanks for the report and your help on this! The code above is now checked into the "develop" branch as commit ee1baff.

swalkner commented 2 years ago

Thanks so much again. I'll close the issue now :-)

ronf commented 1 year ago

This change is now available in AsyncSSH 2.13.0.

ronf / asyncssh

Transmission terminates midway / Files corrupt #510