Closed mxmlnkn closed 1 month ago
I think what you're running into here is the fact that SFTP reads are paralellized for speed. However, when you tell it to read more data than what's actually in the file, it tries to parallelize a lot more read operations, extending all the way out to whatever chunk size you gave it. Eventually many of those reads will fail to return any data, but it still has to wait for those responses to come in.
On the other hand, if you pass in -1 (or let it default to that), the AsyncSSH SFTP client will schedule only the exact reads necessary to read the file data which is actually present, and it will return as soon as those complete.
In your example with a 2 GB chunk size and a 60 MB file, you are schedule like 30x the number of reads you actually need, and it takes something like 12x times as long. If you drop to 1 GB, it's able twice as fast, because only half as many reads are scheduled. This continued to improve until your chunk size is roughly equal to the real file size.
It would be possible to have AsyncSSH check the file size before blindly scheduling the unnecessary reads, but it's probably better to not waste the time to do the stat() to read the file size unless it absolutely has to (when no read size is specified). Otherwise, that stat() would happen on every read() call you make, and that's not something you'd really want.
It would be possible to have AsyncSSH check the file size before blindly scheduling the unnecessary reads, but it's probably better to not waste the time to do the stat() to read the file size unless it absolutely has to (when no read size is specified). Otherwise, that stat() would happen on every read() call you make, and that's not something you'd really want.
I agree that calling stat inside read is not desirable. Paramiko has a max_concurrent_requests
option to limit the problem of exuberant requests. Although, the default for that option is unlimited, which also already made me run into that problem, but even worse, the problem also appears there with read(-1)
. Enlightening transfer size over time plot in the middle of this large post.
AsyncSSH has a max_requests
argument which can be passed into the open()
call which will limit how many outstanding parallel requests it will create. There's also a block_size
requirement to control the maximum number of bytes in a single read or write request. However, when you specify a large total number of bytes, it schedules whatever max_requests
and block_size
will allow and then just locally tracks the remaining bytes it needs to read. Once the requests start to finish, it schedules more of them, until it has read all the bytes you requested. This is why it takes so much longer.
I suppose one option might be to stop as soon as one of the individual reads returns EOF. It would still need to wait for all the reads it still has outstanding, but that could be much less than the total number of requests it would have scheduled. The default block size is 16 KB and max requests is 128, so this would limit the "overshoot" to no more than 2 MiB, if you didn't change either of these values, and you could tune that if you wanted to (though turning it down too much would decrease the parallelism and actually make it slower).
Try the following change:
diff --git a/asyncssh/sftp.py b/asyncssh/sftp.py
index 825dba1..3eb115c 100644
--- a/asyncssh/sftp.py
+++ b/asyncssh/sftp.py
@@ -670,7 +670,7 @@ class _SFTPParallelIO(Generic[_T]):
self._pending.add(asyncio.ensure_future(
self._start_task(offset+count, size-count)))
except SFTPEOFError:
- pass
+ self._bytes_left = 0
except (OSError, SFTPError) as exc:
exceptions.append(exc)
Does that put the time more in line with the read(-1)
case?
Does that put the time more in line with the
read(-1)
case?
Yes, it does. And the diff looks beautifully simple enough that it shouldn't introduce errors.
Terrific - thanks for confirming and putting together than performance measurement here!
As expected, performance will initially increase as you do larger reads. Once you get above 2 MiB, though, you don't really get much benefit as it will already have the maximum number of parallel reads running at once. With this change, the performance will basically top-out after that and be close to what you get in the -1 case, instead of slowing down due to attempting to repeatedly read past the end of the file.
It's also nice to see the much better performance with fsspec with AsyncSSH vs. sshfs (which is using paramiko, I think).
This change is now available as commit 8fdce8a in the "develop" branch.
It's also nice to see the much better performance with fsspec with AsyncSSH vs. sshfs (which is using paramiko, I think).
Yes, it is using Paramiko. I am doing performance comparison in order to decide between the two SSH fsspec implementations.
Note that Paramiko can reach 100 MB/s (3x faster than asyncssh), but only when reading a file fully because for that use case it prefetches the whole file in one go. However, this does not work for my use case because the prefetching is too simple and does not work with the file object abstraction and especially not when combining it with seeking inside the file.
Because of that, I am currently leaning towards using the asyncssh-based fsspec implementation.
However, I am also currently doing comparison benchmarks with other command line tools such as scp, lftp, rsync, and rclone and especially the latter can reach up to 300 MB/s, i.e., 10x faster than asyncssh. It would be really interesting to find out where that performance gets lost in asyncssh. I doubt that this is bottlenecked by CPU time spent for encryption as hard drive encryption can reach multiple gigabytes per second. Maybe something as simple as increasing the maximum amount of concurrent requests might help to increase the bandwidth further, of course, with the possible detriment of overreading.
Thanks for the links...
If you'd like to play with different numbers of parallel requests or different block sizes for the raw reads, you can do that by setting max_requests
and block_size
as mentioned above. These arguments can be passed into open()
, or they can be passed to higher-level calls like get()
which can be used to copy between local and remote files without having to do read or write calls yourself.
With regard to prefetching, I'm not sure I understand what it would mean to do a timing measurement in that case. Once the data is in a local cache, it should be blazingly fast, with the only cost potentially being a memory-to-memory copy to get all the bytes into a single contiguous block. If you describe a bit more about how you might want to use prefetching, I can think about it as a possible feature, if I can generalize it enough.
In a quick test here with file data cached in memory but still going through the ParallelIO code, I can get about 1 GiB/sec (between 900 and 1200 MiB/sec on multiple runs) for reading from a 64 MiB file. That doesn't include the cost of getting the file data into memory the first time, though.
This change is now available as commit 8fdce8a in the "develop" branch.
Thanks again. I'm closing the issue then.
With regard to prefetching, I'm not sure I understand what it would mean to do a timing measurement in that case. Once the data is in a local cache, it should be blazingly fast, with the only cost potentially being a memory-to-memory copy to get all the bytes into a single contiguous block. If you describe a bit more about how you might want to use prefetching, I can think about it as a possible feature, if I can generalize it enough.
I already count the concurrent get requests as prefetching. I.e., instead of triggering the get requests in order after each previous request has finished, they are started in "parallel". It would be even better and more aptly called prefetching if this request starting would even work over multiple read calls. I.e., when doing a read(128 KiB)
, already start requests up to max_concurrent_requests
(2 MiB) instead of limiting it to 128 KiB. Of course, this can be detrimental for many small random-seek scenarios. That's when a "smarter prefetching strategy", as I requested for paramiko would help. So for example:
read(128 KiB)
-> prefetch nothing because we cannot anticipate any pattern, or maybe do a full prefetch with max_concurrent_requests
because it is the very very first call and sequential access is the more likely / more supported use case.read(128 KiB)
-> prefetch the next 256 KiB (up max_concurrent_requests
) because the read followed sequentially from the lastread(128 KiB)
-> prefetch the next 512 KiB (up max_concurrent_requests
) because the read followed sequentially from the last, i.e., a doubling strategy for each consecutive readseek(0)
read(128 KiB)
-> prefetch nothing because the sequential pattern has been broken read(128 KiB)
-> prefetch the next 256 KiB (up max_concurrent_requests
) because the read followed sequentially from the lastThat doesn't include the cost of getting the file data into memory the first time, though.
The paramiko benchmark I mentioned does include that cost. It is not a pure memory-copy-though-abstractions benchmark. The prefetching is done during the benchmark, not before it.
Here are the alluded-to benchmark results for file reading with other tools. The upper 4 or 5 may have the same backend for them to all be somewhat equally fast. I am reading from and to /dev/shm (scp localhost:/dev/shm/silesia.tar.gz /dev/shm/downloaded.tar.gz
) to largely exclude I/O bottlenecks and only benchmark the network stack and encryption.
fsspec/sshfs would be the asyncssh benchmark (with abstraction layer) and fsspec.sftp would be the paramiko benchmark with open
+ read(-1)
.
If you'd like to play with different numbers of parallel requests or different block sizes for the raw reads, you can do that by setting
max_requests
andblock_size
as mentioned above.
Thanks. I'll do that next.
I am a bit stumped. Except for the almost pathological case of max_requests = 1
, I see basically no performance differences at all when varying max_requests
. It gets quite a bit slower for 1 and slightly slower for 4096 but aside from that it is pretty stable at ~30 MB/s. Could it be that something is not working as intended?
I have looked at the source code of fsspec/sshfs and all arguments to SSHFileSystem.open
get forwarded to SSHFile, which does have explicit max_requests
and block_size
arguments, which do get further forwarded to asyncssh. I even added debug output to asyncssh.sftp.SFTPClientFile.read
to print out self._max_requests
and it does indeed print the values I have specified.
I tried to trace it, but there are no obvious time-intensive functions.
asyncio.tasks.wait
called from asyncssh.sftp._SFTPFileReader.iter
. This kinda suggests to me that the requests are not actually fully overlapped, which would explain max_requests
having almost no influence on the performance. Fixing that would maybe speed up the transfer by 2x, simply eyeballing the amount of time shown to be spent inside asyncio.tasks.wait in the trace.asyncssh.compression._ZLibDecompress:decompress
via asyncssh.connection.SSHClientConnection:_recv_packet
appear at all. I did not specify compression and I am pretty sure that the default for SSH is no compression. asyncssh.connection.SSHClientConnection:_recv_packet
-> asyncssh.packet.SSHClientConnection:process_packet
-> asyncssh.connection.SSHClientConnection:send_packet
and asyncssh.sftp.SFTPClientHandler:read
-> ... -> asyncssh.sftp.SFTPClientHandler:_send_request
-> ... -> asyncssh.connection.SSHClientConnection:send_packet
, so simply sending and receiving / encrypting and decrypting network packages. I'd guess that overlapping that in two threads might yield another ~2x speedup. This would also be roughly consistent with my observation that removing threading from paramiko yielded half the speed. Although, I am still surprised that sending the assumedly very small requests would even be in the same ball park as receiving the 48 KiB large replies... And in the paramiko trace, the requests did indeed take up only a very small portion of the time... Maybe I am interpreting the asyncssh / sshfs trace wrong... or there is some performance problem with the request forming.Assuming these two speed ups are achievable, the resulting eyeballed ~4x speedup would be surprisingly consistent with the observed speed in paramiko, but that might also be a coincidence, considering that paramiko without concurrent requests / prefetching yields only 4 MB/s.
The rest of the performance, compared with scp etc., might then "simply" be C vs. Python performance...
The parallelism of max_requests is really only needed when going over a network connection with moderate to high latency. If you are connecting to something like localhost where the latency is practically zero, you get the result you saw where a max_requests of 1 is a little less than 2x slower, and there's a slight hit with max_requests of 2 or 4, but by the time you get to max_requests of 8 or 16 it has pretty much done everything it can.
With a latency as small as 5 msec between the client and server, max_requests makes a much bigger difference. With a max_requests of 1 in this case, the performance is only around 2 MiB/sec, It ramps up from there, but doesn't hit full performance (~60-70 MiB/sec on my system here) until max_requests is at least 128. Above that, no additional gains are seen.
The requests are definitely overlapped, and you can see this if you turn on the AsyncSSH debug output. They're sent in batches, and the results are received in batches on low-latency systems, but there's much more interleaving as the latency goes up.
Regarding threads, keep in mind that AsyncSSH (and asyncio applications in general) generally do everything in a single thread. If you want to leverage multiple threads or processes, you pretty much need multiple event loops, and really even then you are at the mercy of the Python global interpreter lock, so you really need each event loop to be running in a separate process with little or no communication back to the initial process. I think that would be different to do in a context like fsspec, though. While it might be possible to push some compute-intensive stuff to other processes via an executor, the results would have to filter back through the main process for asyncio's cooperative multi-tasking to work properly, and the inter-process communication cost might offset some or all of the gain. This is one area where Paramiko has a potential advantage, though I would expect it to also be at the mercy of the Python global interpreter lock if it used threads.
Regarding compression, I agree that it's odd for that to show up, as AsyncSSH should definitely default to compression being disabled. You might want to check the debug output and make sure that's not getting turned on by fsspec. There was a time when it was enabled by default, and where it actually did speed things up because network connections were so bandwidth-limited (vs. compression making things CPU-limited).
Regarding pre-fetch, would your application ever read data from the same offset within a file more than once, and benefit from caching those previous read results, or are you just expecting the pre-fetch to help reduce the impact of latency by getting reads started a bit earlier? In the case of read() with no length, it's already effectively starting all the read requests it will ever need, so there'd really be no benefit to pre-fetching there if you don't cache read results to be returned multiple times.
Previously, you mentioned that your application can't easily use the read() with no length -- could you say a bit more about that? AsyncSSH should be able to do this even when you are seeking around in the file, and it should be very efficient at skipping over arbitrary amounts of data and starting a read with no length from anywhere.
Have you tried varying the block size? I didn't think that very large block sizes were supported in OpenSSH, so when I originally wrote the code I set the default size of an individual read to 16 KB. However, in a test I did just now, I was able to get OpenSSH to accept read requests up to 32 MiB, and saw some nice speed improvements:
64.000 MiB in 3.392 sec with block_size of 4096, 18.867 MiB/sec
64.000 MiB in 1.749 sec with block_size of 8192, 36.593 MiB/sec
64.000 MiB in 0.950 sec with block_size of 16384, 67.359 MiB/sec
64.000 MiB in 0.577 sec with block_size of 32768, 110.825 MiB/sec
64.000 MiB in 0.445 sec with block_size of 65536, 143.927 MiB/sec
64.000 MiB in 0.360 sec with block_size of 131072, 177.921 MiB/sec
64.000 MiB in 0.352 sec with block_size of 262144, 181.945 MiB/sec
64.000 MiB in 0.319 sec with block_size of 524288, 200.399 MiB/sec
64.000 MiB in 0.327 sec with block_size of 1048576, 195.450 MiB/sec
64.000 MiB in 0.304 sec with block_size of 2097152, 210.777 MiB/sec
64.000 MiB in 0.314 sec with block_size of 4194304, 204.142 MiB/sec
64.000 MiB in 0.301 sec with block_size of 8388608, 212.900 MiB/sec
64.000 MiB in 0.318 sec with block_size of 16777216, 200.971 MiB/sec
64.000 MiB in 0.348 sec with block_size of 33554432, 183.685 MiB/sec
The sweet spot appears to be around 8 MiB in my test here. This ended up being about 3x faster than using a 16 KB block size. Note that such large block sizes may not be supported on all SFTP servers, though, and there wasn't a good way to determine what the maximum read size was that an SFTP server would support. Most servers would simply fail with a truncated size if you exceeded the limit.
I would expect it to also be at the mercy of the Python global interpreter lock if it used threads.
If the main computation time is spent in C code, then the C code sometimes also release the GIL to allow for multi-threading to work. I do the same for the rapidgzip Python bindings, although it was quite a hassle and error-prone at first.
Previously, you mentioned that your application can't easily use the read() with no length -- could you say a bit more about that?
I simply want to add support for SSH to ratarmount, i.e., something like: ratarmount ssh://server/path-to-archive.tar.gz /media/local-mountpoint
and then access the contained files via the mount point to avoid having users to stack two different FUSE mounts: sshfs and ratarmount to achieve the same. Accessing files in this mount point translates to seeks and reads inside the original TAR archive reachable via SSH/SFTP and I'm never interested in reading the whole archive with a single unbounded read call.
AsyncSSH should be able to do this even when you are seeking around in the file, and it should be very efficient at skipping over arbitrary amounts of data and starting a read with no length from anywhere.
Yes, that seemed to work just fine and this is what I was doing in the ~first chart I posted~ linked fsspec/sshfs thread. ~I posted it without explanation, but~ The posted charted in this thread was created by modifying the code in the very first post in this issue and reads the chunks in order, while the benchmark in the linked fsspec/sshfs thread shuffles the chunk indices and therefore calls read(chunk_size) and then seeks to a random different offset before reading the next chunk. It is not sequential reading. This is intended to simulate accesses to files via the ratarmount mount point, which are then downloaded from subranges inside the archive file.
Have you tried varying the block size?
Only shortly. I began with a double for loop but then got side-tracked benchmarking on the max_requests because I did not see any change, even when varying the block_size. It seems that there are again only significant effects for almost pathologically small block sizes:
[SFTPClientFile.read] self._block_size: 1024, self._max_requests: 128
[fsspec/sshfs] Block size: 1 KiB Read 68238807 in 17.78 s -> 3.84 MB/s
[SFTPClientFile.read] self._block_size: 4096, self._max_requests: 128
[fsspec/sshfs] Block size: 4 KiB Read 68238807 in 4.51 s -> 15.13 MB/s
[SFTPClientFile.read] self._block_size: 8192, self._max_requests: 128
[fsspec/sshfs] Block size: 8 KiB Read 68238807 in 2.23 s -> 30.64 MB/s
[SFTPClientFile.read] self._block_size: 16384, self._max_requests: 128
[fsspec/sshfs] Block size: 16 KiB Read 68238807 in 2.10 s -> 32.46 MB/s
[SFTPClientFile.read] self._block_size: 32768, self._max_requests: 128
[fsspec/sshfs] Block size: 32 KiB Read 68238807 in 2.11 s -> 32.40 MB/s
[SFTPClientFile.read] self._block_size: 131072, self._max_requests: 128
[fsspec/sshfs] Block size: 128 KiB Read 68238807 in 2.07 s -> 32.96 MB/s
[SFTPClientFile.read] self._block_size: 524288, self._max_requests: 128
[fsspec/sshfs] Block size: 512 KiB Read 68238807 in 2.05 s -> 33.28 MB/s
[SFTPClientFile.read] self._block_size: 2097152, self._max_requests: 128
[fsspec/sshfs] Block size: 2048 KiB Read 68238807 in 2.05 s -> 33.24 MB/s
I set the default size of an individual read to 16 KB.
Note that fsspec/asyncssh overrides these defaults to be 48 KiB (64 KiB minus a "safety" of 16 KiB).
I also saw the size limitations being mentioned there, which made me wary to vary them. Considering the speed improvements you observed, I was wondering whether scp and the other fast tools maybe use higher block sizes, maybe even choose the block size based on the greeting / identification information that might get sent during the connection setup. The manual of sftp mentions a default buffer size of 32 KiB (option -B
) and 64 outstanding requests (option -R
), which can also be found in the source code. However, there does indeed seem to be some limit querying logic to adjust the download request size. Support for this limit querying seems to be checked during the connection initialization. Running scp -vvv
does not display these limits, so I may have a different version or implementation or the server (OpenSSH 8.9) does not support it. But there seems to be some dynamic block size increase / trial-and-error to be going on:
[...]
debug2: fd 3 setting TCP_NODELAY
debug3: set_sock_tos: set socket 3 IP_TOS 0x08
debug2: client_session2_setup: id 0
[...]
debug1: Sending command: scp -v -f /dev/shm/sftp_shared/silesia.tar.gz
[...]
debug2: exec request accepted on channel 0
debug2: channel 0: rcvd ext data 39
scp: debug1: fd 3 clearing O_NONBLOCK
[...]
scp: debug2: fd 7 setting O_NONBLOCK
debug2: channel 0: window 1998729 sent adjust 98423
debug2: channel 0: window 1900544 sent adjust 196608
debug2: channel 0: window 1966080 sent adjust 131072
debug2: channel 0: window 1966080 sent adjust 131072
debug2: channel 0: window 1966080 sent adjust 131072
debug2: channel 0: window 1966080 sent adjust 131072
debug2: channel 0: window 1966080 sent adjust 131072
debug2: channel 0: window 1966080 sent adjust 131072
[...]
debug2: channel 0: window 1966080 sent adjust 131072
debug2: channel 0: rcvd ext data 39
scp: debug1: fd 1 clearing O_NONBLOCK
debug2: channel 0: written 39 to efd 6
scp: debug1: fd 7 clearing O_NONBLOCK
[...]
Transferred: sent 13676, received 68283588 bytes, in 0.3 seconds
Bytes per second: sent 49347.4, received 246389150.9
debug1: Exit status 0
It first tries with ~98 kB, then 192 KiB, which probably fails, and then settles to 128 KiB.
I included the lines with NODELAY
and NONBLOCK
because they looked a bit like performance optimizations.
Looking more closely at the results, it appears that OpenSSH's maximum read size is 261120 bytes (256 kiB) minus 1 kIB of overhead for the SFTP message headers. So, all my tests with a larger read size than that were being compensated for in my parallel I/O code, effectively scheduling additional out-of-order reads when the code tried to read more than the server could return in a single request. I think this interacted with my fix to prevent read past EOF, so the performance improvement I saw may not be accurate. I'm redoing the tests keeping that in mind, and may also need to change that read fix.
I'm also looking at implementing the OpenSSH 'limits' request. I had run across it before, but had kind of forgotten about it until you mentioned it. My goal is to have the code default to whatever the max the server supports for read & write lengths, if the server supports this extension. Callers could still override this by setting block_size
, but if they don't AsyncSSH should automatically pick values which maximize performance.
I'll let you know when I have something working here.
Ok - I've implemented the OpenSSH "limits" request, and I am seeing some nice improvements when using it when no block size is set. Even with 128 parallel requests, the small block sizes really kill the through. So, increasing from my previous default of 16 kiB to OpenSSH's default of 255 kiB is a big jump:
64.000 MiB in 52.129 sec with read len of 256, 1.228 MiB/sec
64.000 MiB in 26.371 sec with read len of 512, 2.427 MiB/sec
64.000 MiB in 13.381 sec with read len of 1024, 4.783 MiB/sec
64.000 MiB in 6.826 sec with read len of 2048, 9.376 MiB/sec
64.000 MiB in 3.529 sec with read len of 4096, 18.133 MiB/sec
64.000 MiB in 1.866 sec with read len of 8192, 34.304 MiB/sec
64.000 MiB in 1.020 sec with read len of 16384, 62.744 MiB/sec
64.000 MiB in 0.615 sec with read len of 32768, 104.150 MiB/sec
64.000 MiB in 0.471 sec with read len of 65536, 136.012 MiB/sec
64.000 MiB in 0.382 sec with read len of 131072, 167.341 MiB/sec
64.000 MiB in 0.377 sec with read len of 262144, 169.610 MiB/sec
64.000 MiB in 0.343 sec with read len of 524288, 186.543 MiB/sec
64.000 MiB in 0.325 sec with read len of 1048576, 196.692 MiB/sec
64.000 MiB in 0.331 sec with read len of 2097152, 193.196 MiB/sec
64.000 MiB in 0.326 sec with read len of 4194304, 196.500 MiB/sec
64.000 MiB in 0.308 sec with default read len, 207.740 MiB/sec
The last row here is with the OpenSSH max of 261120 bytes (255 kiB), picked up by making a "limits" request. This actually gives slightly better results than the larger read lengths, as the larger reads end up getting broken down into multiple 255 kiB requests when the server fails to respond with the full amount of data requested.
Since there no over-reading in these tests, the performance was about the same with or without the one-line fix I previously proposed to reduce the cost of over-reading. Fixing that properly will be a bit more complicated due to potential out-of-order read requests, though. These can be triggered when a client exceeds the server's max read length, but it shouldn't be too difficult to handle. I just need to keep track of the offset at the time EOF was returned and only allow reads earlier than that, rather than stopping all reads after EOF is seen.
I still need to do some testing and cleanup before checking this in, but it's looking promising.
Regarding SCP, what you're seeing there is not a change in request sizes. With SCP, you request a file and the server just sends the entire contents of the file to you in a single streamed response. There are multiple writes happening, but those writes are all coming from the server, without any additional requests being sent by the client. As such, the dynamic are different than SFTP. What you're seeing is the result of the stream flow-control messages in SSH. By default, its default window size is 2 MiB. The client periodically acknowledges that it has read the data as it processes it, and this opens up the window for the server to send more. If a client ever stops reading or can't keep up with the rate data is being sent, the window will eventually fill up and the server will block until the client can catch up. That's not happening here, though -- the adjustments are being made long before the window has filled up, so the server is basically pumping data as far as it can.
I've just checked in support for the OpenSSH "limits" extension in commit 125ea08 in the "develop" branch.
When acting as a client, AsyncSSH will default its read and write lengths to match the maximum value advertised by servers which support this extension, falling back to "safe" values of 16 KiB for both read & write (the previous default) when the extension is not supported.
When acting as a server, AsyncSSH will advertise a maximum read and write size of 4 MiB, and a packet size large enough to hold writes of that size.
As before you can set block_size
to pick different values, but generally speaking you shouldn't need to do that.
The test numbers are still pretty much what you see above (about 200 MiB/second on my system) when running against OpenSSH with default values for block size and a max requests of 128 on a 64 MiB file. In this case, it is using the OpenSSH max of 255 KiB for the block size. I got similar speed for both a read of the entire file and a read where I skipped past the first 32 MiB and read the second 32 MiB (accomplished in half the time of the full file read). Even smaller reads like 4 MiB in the middle of the file got close to this same speed.
I am seeing a bit of weirdness when the total file size is smaller than the read I'm attempting. I need to do a little more digging into that -- there may be a bug there. In the meantime, though, I'm hoping you have enough in this commit to do some performance testing of your own.
Found it - the problem happened only when the application-specified block_size
was set larger than both the server supported maximum read length and the total size of the read()
call. In this case, the code disables the parallel I/O, thinking it can do a single read(), but that doesn't handle partial read results properly.
The fix will make sure to keep parallel I/O enabled to deal with partial read results unless either block_size
is set to 0 or the read size is below the server-supported max read length.
The fix is in commit 1f9c8ab in the "develop" branch.
I've also done some testing with the latest code on the over-read case. I was able to keep the original fix there after all, though I did end up fixing another unrelated issue when a file changes size in the middle of a transfer as I did the testing.
There is a small performance drop as it has to wait for its initial burst of 128 requests to complete, but the hit remains small even with a massive over-read.
Thank you for implementing that!
I tried the develop branch, and the block size detection seems to work (after ensuring that fsspec/sshfs does not specify a hardcoded default), but just as with my manual block size variation benchmarks before, the bandwidth is not affected very much:
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 1
[fsspec/sshfs] Max requests: 1 Read 32997376 in 1.25 s -> 26.38 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 2
[fsspec/sshfs] Max requests: 2 Read 32997376 in 0.99 s -> 33.27 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 4
[fsspec/sshfs] Max requests: 4 Read 32997376 in 0.96 s -> 34.54 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 16
[fsspec/sshfs] Max requests: 16 Read 32997376 in 0.94 s -> 35.11 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 64
[fsspec/sshfs] Max requests: 64 Read 32997376 in 0.94 s -> 35.18 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 128
[fsspec/sshfs] Max requests: 128 Read 32997376 in 0.96 s -> 34.45 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 256
[fsspec/sshfs] Max requests: 256 Read 32997376 in 0.97 s -> 33.94 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 512
[fsspec/sshfs] Max requests: 512 Read 32997376 in 0.94 s -> 34.94 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 1024
[fsspec/sshfs] Max requests: 1024 Read 32997376 in 0.93 s -> 35.30 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 4096
[fsspec/sshfs] Max requests: 4096 Read 32997376 in 0.95 s -> 34.89 MB/s
Compared with my earlier benchmarks here, which are still reproducible on my system, it can be 1-2 MB/s faster, and although measurable, it is not as much as you are observing. I am testing with OpenSSH as the server.
Yeah - it seems like you're running into some other bottleneck, perhaps somewhere in the fsspec code.
Also, at one point you mentioned compression showing up in the profiling. Did you ever get to the bottom of that? Could fsspec be trying to enable compression? With compression enabled, I see very little benefit from the parallelism, and a substantial performance drop, to around 45 MiB/sec:
64.000 MiB in 1.484 sec with read len of 16384, 43.131 MiB/sec
64.000 MiB in 1.441 sec with read len of 32768, 44.401 MiB/sec
64.000 MiB in 1.444 sec with read len of 65536, 44.327 MiB/sec
64.000 MiB in 1.416 sec with read len of 131072, 45.196 MiB/sec
64.000 MiB in 1.428 sec with read len of 262144, 44.833 MiB/sec
64.000 MiB in 1.416 sec with read len of 524288, 45.187 MiB/sec
64.000 MiB in 1.411 sec with read len of 1048576, 45.359 MiB/sec
64.000 MiB in 1.412 sec with read len of 2097152, 45.336 MiB/sec
64.000 MiB in 1.420 sec with read len of 4194304, 45.073 MiB/sec
64.000 MiB in 1.412 sec with default read len, 45.334 MiB/sec
Also, do you know what size reads that code is doing? For the AsyncSSH parallel I/O to be helpful, you need very large read sizes. Otherwise, the amount of parallelism drops to the point where it doesn't buy you much, unless parallel reads are scheduled by the application code. With OpenSSH's 255 KiB max read length and 128 requests, each read call would need to be at least around 32 MiB in size to get the full parallelism. You'd probably get some benefit with smaller sizes, but I'm guessing you'd need the reads to be on the order of 4 MiB or more for it to provide any significant benefit.
Regarding testing, I am also testing against OpenSSH here. The client is running on macOS and the server is either localhost on that same system or in some cases a Linux VM running on that system. In some cases, I added some forced delays to simulate higher network latency.
Also, at one point you mentioned compression showing up in the profiling. Did you ever get to the bottom of that?
No, thank you for reminding me again!
After, I added Compression No
to my ~/.ssh/config
for that particular connection, I got these results:
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 1
[fsspec/sshfs] Max requests: 1 Read 68238807 in 0.66 s -> 102.63 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 2
[fsspec/sshfs] Max requests: 2 Read 68238807 in 0.38 s -> 177.32 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 4
[fsspec/sshfs] Max requests: 4 Read 68238807 in 0.36 s -> 190.73 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 16
[fsspec/sshfs] Max requests: 16 Read 68238807 in 0.36 s -> 192.13 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 64
[fsspec/sshfs] Max requests: 64 Read 68238807 in 0.35 s -> 193.29 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 128
[fsspec/sshfs] Max requests: 128 Read 68238807 in 0.36 s -> 191.36 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 256
[fsspec/sshfs] Max requests: 256 Read 68238807 in 0.36 s -> 188.36 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 512
[fsspec/sshfs] Max requests: 512 Read 68238807 in 0.35 s -> 192.61 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 1024
[fsspec/sshfs] Max requests: 1024 Read 68238807 in 0.36 s -> 191.75 MB/s
[_SFTPParallelIO.iter] self._block_size: 261120, self._max_requests: 4096
[fsspec/sshfs] Max requests: 4096 Read 68238807 in 0.36 s -> 188.78 MB/s
As can be seen, this is benchmark includes the patch for querying the block size limit and using that as a default.
Varying block sizes:
[_SFTPParallelIO.iter] self._block_size: 1024, self._max_requests: 128
[fsspec/sshfs] Block size: 1 KiB Read 68238807 in 16.89 s -> 4.04 MB/s
[_SFTPParallelIO.iter] self._block_size: 2048, self._max_requests: 128
[fsspec/sshfs] Block size: 2 KiB Read 68238807 in 8.66 s -> 7.88 MB/s
[_SFTPParallelIO.iter] self._block_size: 4096, self._max_requests: 128
[fsspec/sshfs] Block size: 4 KiB Read 68238807 in 4.38 s -> 15.57 MB/s
[_SFTPParallelIO.iter] self._block_size: 8192, self._max_requests: 128
[fsspec/sshfs] Block size: 8 KiB Read 68238807 in 2.32 s -> 29.45 MB/s
[_SFTPParallelIO.iter] self._block_size: 16384, self._max_requests: 128
[fsspec/sshfs] Block size: 16 KiB Read 68238807 in 1.19 s -> 57.17 MB/s
[_SFTPParallelIO.iter] self._block_size: 65536, self._max_requests: 128
[fsspec/sshfs] Block size: 64 KiB Read 68238807 in 0.50 s -> 135.45 MB/s
[_SFTPParallelIO.iter] self._block_size: 131072, self._max_requests: 128
[fsspec/sshfs] Block size: 128 KiB Read 68238807 in 0.40 s -> 170.95 MB/s
[_SFTPParallelIO.iter] self._block_size: 262144, self._max_requests: 128
[fsspec/sshfs] Block size: 256 KiB Read 68238807 in 0.41 s -> 168.15 MB/s
[_SFTPParallelIO.iter] self._block_size: 524288, self._max_requests: 128
[fsspec/sshfs] Block size: 512 KiB Read 68238807 in 0.39 s -> 176.18 MB/s
[_SFTPParallelIO.iter] self._block_size: 1048576, self._max_requests: 128
[fsspec/sshfs] Block size: 1024 KiB Read 68238807 in 0.38 s -> 181.49 MB/s
[_SFTPParallelIO.iter] self._block_size: 4194304, self._max_requests: 128
[fsspec/sshfs] Block size: 4096 KiB Read 68238807 in 0.40 s -> 171.86 MB/s
Decompression is probably less the bottleneck and more so the compression on the server.
The next question now becomes:
Checking the stack:
/etc/ssh/sshd_config
has a Compression
option.
My servers do not have values configured for that option.
The default is delayed
(a legacy synonym for yes) or yes
in newer versions.~/.ssh/config
can specify Compression
options, which I also did not configure. Therefore, the used default should be No
.scp
does not use compression. If compression is forced with Compression Yes
in the SSH client config, the verbose output will contain a line such as this: debug1: Enabling compression at level 6.
.logging.basicConfig(filename="asyncssh.log", level=logging.DEBUG)
asyncssh.set_debug_level(2)
async def run_client():
async with asyncssh.connect(hostname, 22) as conn:
async with conn.start_sftp_client() as sftp:
file = await sftp.open("silesia.tar.gz", "rb")
t0 = time.time()
data = await file.read()
size = len(data)
t1 = time.time()
print(f"Read {size / 1e6:.2f} MB in {t1-t0:.2f} s -> {size/(t1-t0)/1e6:.2f} MB/s")
try:
asyncio.run(run_client())
except (OSError, asyncssh.Error) as exc:
sys.exit('SSH connection failed: ' + str(exc))
I ran this simple example while adjusting the Compression
option inside ~/.ssh/config
.
Compression Yes
[SSHConfig.get_compression_algs] compression: True -> zlib@openssh.com,zlib,none
[SSHConnectionOptions.prepare] self.compression_algs: [b'zlib@openssh.com', b'zlib', b'none']
[_ZLibDecompress.__init__]
[_SFTPParallelIO.iter] self._block_size: 16384, self._max_requests: 128
Read 68.23 MB in 2.12 s -> 32.18 MB/s
Check SSHD logs with journalctl -xet sshd | grep -i compr
:
sshd: debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: zlib@openssh.com [preauth]
sshd: debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: zlib@openssh.com [preauth]
sshd: debug1: Enabling compression at level 6. [preauth]
sshd: debug1: Enabling compression at level 6.
Checking the asyncssh log output with grep -n -C 3 -i compr asyncssh.log
:
9-DEBUG:asyncssh:[conn=0] Requesting key exchange
14:DEBUG:asyncssh:[conn=0] Compression algs: zlib@openssh.com,zlib,none
15-DEBUG:asyncssh:[conn=0] Received key exchange request
18-DEBUG:asyncssh:[conn=0] Client to server:
21:DEBUG:asyncssh:[conn=0] Compression algs: none,zlib@openssh.com
22-DEBUG:asyncssh:[conn=0] Server to client:
25:DEBUG:asyncssh:[conn=0] Compression algs: none,zlib@openssh.com
26-DEBUG:asyncssh:[conn=0] Beginning key exchange
28-DEBUG:asyncssh:[conn=0] Client to server:
31:DEBUG:asyncssh:[conn=0] Compression alg: zlib@openssh.com
32-DEBUG:asyncssh:[conn=0] Server to client:
35:DEBUG:asyncssh:[conn=0] Compression alg: zlib@openssh.com
Compression No
[SSHConfig.get_compression_algs] compression: False -> none,zlib@openssh.com,zlib
[SSHConnectionOptions.prepare] self.compression_algs: [b'none', b'zlib@openssh.com', b'zlib']
[_SFTPParallelIO.iter] self._block_size: 16384, self._max_requests: 128
Read 68.23 MB in 2.12 s -> 32.16 MB/s
Check SSHD logs with journalctl -xet sshd | grep -i compr
:
sshd: debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none [preauth]
sshd: debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none [preauth]
Checking the asyncssh log output with grep -n -C 3 -i compr asyncssh.log
:
9-DEBUG:asyncssh:[conn=0] Requesting key exchange
14:DEBUG:asyncssh:[conn=0] Compression algs: none,zlib@openssh.com,zlib
15-DEBUG:asyncssh:[conn=0] Received key exchange request
18-DEBUG:asyncssh:[conn=0] Client to server:
21:DEBUG:asyncssh:[conn=0] Compression algs: none,zlib@openssh.com
22-DEBUG:asyncssh:[conn=0] Server to client:
25:DEBUG:asyncssh:[conn=0] Compression algs: none,zlib@openssh.com
26-DEBUG:asyncssh:[conn=0] Beginning key exchange
28-DEBUG:asyncssh:[conn=0] Client to server:
31:DEBUG:asyncssh:[conn=0] Compression alg: none
32-DEBUG:asyncssh:[conn=0] Server to client:
35:DEBUG:asyncssh:[conn=0] Compression alg: none
Compression
not specified (default)[SSHConfig.get_compression_algs] compression: None -> ()
[SSHConnectionOptions.prepare] self.compression_algs: [b'zlib@openssh.com', b'none']
[_ZLibDecompress.__init__]
[_SFTPParallelIO.iter] self._block_size: 16384, self._max_requests: 128
Read 68.23 MB in 2.20 s -> 30.95 MB/s
Check SSHD logs with journalctl -xet sshd | grep -i compr
:
sshd: debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: zlib@openssh.com [preauth]
sshd: debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: zlib@openssh.com [preauth]
sshd: debug1: Enabling compression at level 6. [preauth]
sshd: debug1: Enabling compression at level 6.
Checking the asyncssh log output with grep -n -C 3 -i compr asyncssh.log
:
9-DEBUG:asyncssh:[conn=0] Requesting key exchange
14:DEBUG:asyncssh:[conn=0] Compression algs: zlib@openssh.com,none
15-DEBUG:asyncssh:[conn=0] Received key exchange request
18-DEBUG:asyncssh:[conn=0] Client to server:
21:DEBUG:asyncssh:[conn=0] Compression algs: none,zlib@openssh.com
22-DEBUG:asyncssh:[conn=0] Server to client:
25:DEBUG:asyncssh:[conn=0] Compression algs: none,zlib@openssh.com
26-DEBUG:asyncssh:[conn=0] Beginning key exchange
28-DEBUG:asyncssh:[conn=0] Client to server:
31:DEBUG:asyncssh:[conn=0] Compression alg: zlib@openssh.com
32-DEBUG:asyncssh:[conn=0] Server to client:
35:DEBUG:asyncssh:[conn=0] Compression alg: zlib@openssh.com
Applying this patch fixes the default behavior for me. I don't know whether this is the best fix. It is complicated by the fact that the server returns compression to be enabled by default, but the client requests no compression by default.
--- /home/user/.local/lib/python3.10/site-packages/asyncssh/connection.py 2024-09-30 21:34:23.470034470 +0200
+++ /home/user/.local/lib/python3.10/site-packages/asyncssh/connection.py 2024-09-30 21:34:28.986055153 +0200
@@ -622,8 +622,7 @@
def _validate_algs(config: SSHConfig, kex_algs_arg: _AlgsArg,
enc_algs_arg: _AlgsArg, mac_algs_arg: _AlgsArg,
- cmp_algs_arg: _AlgsArg, sig_algs_arg: _AlgsArg,
- allow_x509: bool) -> \
+ sig_algs_arg: _AlgsArg, allow_x509: bool) -> \
Tuple[Sequence[bytes], Sequence[bytes], Sequence[bytes],
Sequence[bytes], Sequence[bytes]]:
"""Validate requested algorithms"""
@@ -638,10 +637,6 @@
mac_algs = _select_algs('MAC', mac_algs_arg,
cast(_AlgsArg, config.get('MACs', ())),
get_mac_algs(), get_default_mac_algs())
- cmp_algs = _select_algs('compression', cmp_algs_arg,
- cast(_AlgsArg, config.get_compression_algs()),
- get_compression_algs(),
- get_default_compression_algs(), b'none')
allowed_sig_algs = get_x509_certificate_algs() if allow_x509 else []
allowed_sig_algs = allowed_sig_algs + get_public_key_algs()
@@ -654,7 +649,7 @@
config.get('CASignatureAlgorithms', ())),
allowed_sig_algs, default_sig_algs)
- return kex_algs, enc_algs, mac_algs, cmp_algs, sig_algs
+ return kex_algs, enc_algs, mac_algs, sig_algs
class SSHAcceptor:
@@ -7139,7 +7134,7 @@
proxy_command: DefTuple[_ProxyCommand], family: DefTuple[int],
local_addr: DefTuple[HostPort], tcp_keepalive: DefTuple[bool],
kex_algs: _AlgsArg, encryption_algs: _AlgsArg,
- mac_algs: _AlgsArg, compression_algs: _AlgsArg,
+ mac_algs: _AlgsArg,
signature_algs: _AlgsArg, host_based_auth: _AuthArg,
public_key_auth: _AuthArg, kbdint_auth: _AuthArg,
password_auth: _AuthArg, x509_trusted_certs: CertListArg,
@@ -7184,10 +7179,9 @@
else config.get('TCPKeepAlive', True))
self.kex_algs, self.encryption_algs, self.mac_algs, \
- self.compression_algs, self.signature_algs = \
+ self.signature_algs = \
_validate_algs(config, kex_algs, encryption_algs, mac_algs,
- compression_algs, signature_algs,
- x509_trusted_certs is not None)
+ signature_algs, x509_trusted_certs is not None)
self.host_based_auth = \
cast(bool, host_based_auth if host_based_auth != () else
@@ -7900,10 +7894,15 @@
keepalive_count_max: int
+ self.compression_algs = _select_algs(
+ 'compression', compression_algs,
+ cast(_AlgsArg, config.get_compression_algs()),
+ get_compression_algs(), [b'none'], b'none')
+
super().prepare(config, client_factory or SSHClient, client_version,
host, port, tunnel, passphrase, proxy_command, family,
local_addr, tcp_keepalive, kex_algs, encryption_algs,
- mac_algs, compression_algs, signature_algs,
+ mac_algs, signature_algs,
host_based_auth, public_key_auth, kbdint_auth,
password_auth, x509_trusted_certs,
x509_trusted_cert_paths, x509_purposes, rekey_bytes,
@@ -8531,10 +8530,16 @@
keepalive_count_max: int
+ self.compression_algs = _select_algs(
+ 'compression', compression_algs,
+ cast(_AlgsArg, config.get_compression_algs()),
+ get_compression_algs(),
+ get_default_compression_algs(), b'none')
+
super().prepare(config, server_factory or SSHServer, server_version,
host, port, tunnel, passphrase, proxy_command, family,
local_addr, tcp_keepalive, kex_algs, encryption_algs,
- mac_algs, compression_algs, signature_algs,
+ mac_algs, signature_algs,
host_based_auth, public_key_auth, kbdint_auth,
password_auth, x509_trusted_certs,
x509_trusted_cert_paths, x509_purposes,
Now, without specifying any compression option, it will not use compression making it consistent with the 5 other tools I am comparing with.
Hmm, you're right - I thought at some point I had changed the default for compression to off, and indeed it was off by default in all of my testing. However, I see now that I have "Compression no" in my .ssh/config file, and that's why that was the case.
I think it's reasonable to change the default here. That wouldn't impact any code which explicitly enables or disables compression, via either the OpenSSH config file or compression_algs
and OpenSSH actually expects the default to be "no" here, so anyone really wanting compression would need to have requested it explicitly in their config, in which case AsyncSSH would pick that up if it was done through the config file.
The change should be a lot simpler than what you suggested, as defaults can be set when algorithms are registered. The following should work:
diff --git a/asyncssh/compression.py b/asyncssh/compression.py
index f014760..1a3a84b 100644
--- a/asyncssh/compression.py
+++ b/asyncssh/compression.py
@@ -149,9 +149,9 @@ def get_decompressor(alg: bytes) -> Optional[Decompressor]:
return _cmp_decompressors[alg]()
+register_compression_alg(b'none',
+ _none, _none, False, True)
register_compression_alg(b'zlib@openssh.com',
_ZLibCompress, _ZLibDecompress, True, True)
register_compression_alg(b'zlib',
_ZLibCompress, _ZLibDecompress, False, False)
-register_compression_alg(b'none',
- _none, _none, False, True)
By moving up the 'none' entry, it will now come first in the list of default algorithms and become the preferred value when nothing is specified.
Ah, you are right. This patch works just as fine for me. I didn't think about changing the order.
I think it's reasonable to change the default here. That wouldn't impact any code which explicitly enables or disables compression, via either the OpenSSH config file or
compression_algs
and OpenSSH actually expects the default to be "no" here, so anyone really wanting compression would need to have requested it explicitly in their config, in which case AsyncSSH would pick that up if it was done through the config file.
I completely agree in all points.
This change is now available as commit 46636d6 in the "develop" branch.
Thank you so much for all your work!
This fix is now available in AsyncSSH 2.18.0.
This may arguably be a questionable use case, but I noticed that
read(1 << 30)
(try to read 1 GiB from a 60 MB file) can be ~3x slower thanread(-1)
. I would have expected both to be equally fast.Consider this benchmark:
Output:
So, trying to read 2 GiB, but only getting 60 MB, takes ~25 s, while it only takes 2 s with
read(-1)
.I do not know whether this is an issue with this wrapper or asyncssh directly because I was unable to adjust my (synchronously running) benchmark to use asnycssh directly.
I would guess that some code tries to do something O(size) for whatever reason even when it should iterate in chunks according to the buffer size, I'd think. I don't see memory usage spiking, so at least nothing that large seems to get allocated.