Performance regression from v1.12.0 to v1.13.0 (and master)

ncw commented 3 years ago

While investigating https://github.com/rclone/rclone/issues/5197 I discovered that doing transfers to rsync.net had gone from 300 KB/s in v1.12.0 to 100KB/s in v1.13.0

The rsync.net servers run FreeBSD which is one unusual thing about them and the other is that they are a long way away from me (150ms) so have the usual problems with long fat TCP pipes. Rclone uses the ReadFrom interface in the sftp client so that the sftp library can increase the number of outstanding packets to help with this.

Here are some tests (my upload is capable of 2MB/s).

make && rclone version && rclone copy -vv --stats 10s /tmp/499.91M rsyncnet: 2>&1 | grep 499.91M:

v1.12.0

 *                                       499.91M:  0% /499.910M, 312.432k/s, 27m8s
 *                                       499.91M:  1% /499.910M, 309.264k/s, 27m15s
 *                                       499.91M:  1% /499.910M, 281.361k/s, 29m49s
 *                                       499.91M:  2% /499.910M, 312.820k/s, 26m37s
 *                                       499.91M:  3% /499.910M, 347.230k/s, 23m48s

v1.13.0

 *                                       499.91M:  0% /499.910M, 81.331k/s, 1h44m44s
 *                                       499.91M:  0% /499.910M, 97.554k/s, 1h27m8s
 *                                       499.91M:  0% /499.910M, 98.655k/s, 1h25m59s
 *                                       499.91M:  0% /499.910M, 122.371k/s, 1h9m7s
 *                                       499.91M:  1% /499.910M, 122.173k/s, 1h9m4s

master

 *                                       499.91M:  0% /499.910M, 109.761k/s, 1h17m33s
 *                                       499.91M:  0% /499.910M, 102.153k/s, 1h23m11s
 *                                       499.91M:  0% /499.910M, 91.254k/s, 1h32m58s
 *                                       499.91M:  0% /499.910M, 85.305k/s, 1h39m18s
 *                                       499.91M:  0% /499.910M, 76.641k/s, 1h50m23s

I bisected this change and discovered this commit is probably the problem. I'm reasonably sure that is the problem commit, and it is certainly touching the code in question. Before this commit at 0be6950c0e91d5cb73a4d690270d3a5010ac9808 the performance is definitely OK and after it is is bad. However there is some variation after the commit so there may be more commits involved.

commit fc156996912168806d33b55794e2f5436fae2c3d Author: Cassondra Foesch Date: Sun Feb 21 20:32:09 2021 +0000
fixed concurrent writes, mid-polish
client.go | 614 ++++++++++++++++++++++++++++++++++++++------------------------ 1 file changed, 373 insertions(+), 241 deletions(-)

Cc: @puellanivis

drakkan commented 3 years ago

@ncw, do you have concurrent writes enabled?

ncw commented 3 years ago

Do you have concurrent writes enabled?

No concurrent writes are not enabled.

puellanivis commented 3 years ago

Writes are by default not handled concurrently, and instead are handled entirely synchronously, which makes it quite slow for long fat pipes. We opted the default to be “safe, but slow” rather than “dangerous, but fast”.

After having read the documentation, we should be able to to queue things up in an ordered manner, and without the free concurrency that is being used. This would at least speed up the the long fat pipes but also retain a lot of safety. 🤔 Hm.

ncw commented 3 years ago

Writes are by default not handled concurrently, and instead are handled entirely synchronously, which makes it quite slow for long fat pipes.

Rclone uses the ReadFrom interface

https://github.com/pkg/sftp/blob/f5f52ff56bd8711f52e51caf51cee29f0f093b9d/client.go#L1623-L1630

This is specifically designed to avoid the problems in the Write interface

https://github.com/pkg/sftp/blob/f5f52ff56bd8711f52e51caf51cee29f0f093b9d/client.go#L1324-L1332

We opted the default to be “safe, but slow” rather than “dangerous, but fast”.

From a users point of view this speedup used to work and now doesn't :-(

Can we revert the changes that caused the regression? I'm guessing not easily since there looked to be a lot of changes in that code.

puellanivis commented 3 years ago

No, we cannot easily change this. The problem is that with parallel writes being done it’s possible that a write at n+1 could succeed, but a write at n could fail. Then the file is n+2 long, even though the nᵗʰ block is empty. It’s the responsibility of the caller to ensure that the final end length is only valid data, there is no way for us to really ensure this.

If you’re fine with that risk, then turn on concurrent writes, and see your performance go up.

puellanivis commented 3 years ago

As I mentioned, we might be able to do a “safer” version, where we just push everything in order sequentially, but as the writes would be non-overlapping, the end server could still parallelize them anyways, so we’re basically right back to the problem of writes potentially leaving a hole of data in the file… I think we could overlap the writes by n-bytes and a server would then be forced by the standard to handle the writes sequentially, and not out-of-order, but that would be a particularly weird and overly clever hack…

I think it’s still way safer to just require users to actively turn on potentially unsafe behavior, even if that unsafe behavior is pretty unlikely in the normal case.

drakkan commented 3 years ago

Hi, happy Easter,

@ncw with concurrent writes enabled if an upload fails and you try to resume it, you will likely end up with a corrupted file, so we basically fixed a bug.

Last week, an SFTPGo user, using the SFTP proxy feature with rsync.net as a backend, reported slow uploads. I added a setting to enable concurrent writes (I have disabled resuming uploads with this setting enabled) and the upload speed has gone from 5MB /s to 15MB/s, so pkg/sftp still has this feature it is just not enabled by default to prevent sneaky bugs

ncw commented 3 years ago

So what I'm understanding from the above is that concurrent writes used to be enabled (for v1.12.0) but is no longer enabled (for v1.13.0) for safety reasons and it is necessary to enable them specifically now?

I'll give it a go in a moment.

drakkan commented 3 years ago

So what I'm understanding from the above is that concurrent writes used to be enabled (for v1.12.0) but is no longer enabled (for v1.13.0) for safety reasons and it is necessary to enable them specifically now?

I'll give it a go in a moment.

Yes, this should work, please let us know your results, thank you

I'll add a note to the changelog

ncw commented 3 years ago

There was no immediate effect setting the flag.

In order to make it work I had to add this patch

diff --git a/client.go b/client.go
index 81b2d6c..6f2c79e 100644
--- a/client.go
+++ b/client.go
@@ -1641,8 +1641,8 @@ func (f *File) ReadFrom(r io.Reader) (int64, error) {
        case *io.LimitedReader:
            remain = r.N

-       case *os.File:
-           // For files, always presume max concurrency.
+       default:
+           // For every thing else, always presume max concurrency.
            remain = math.MaxInt64
        }

After that it did work very well

 *                                       499.91M:  1% /499.910M, 735.491k/s, 11m26s
 *                                       499.91M:  2% /499.910M, 620.767k/s, 13m25s
 *                                       499.91M:  3% /499.910M, 522.179k/s, 15m47s
 *                                       499.91M:  5% /499.910M, 694.876k/s, 11m39s
 *                                       499.91M:  7% /499.910M, 949.944k/s, 8m17s
 *                                       499.91M: 11% /499.910M, 1.307M/s, 5m40s

So much faster than v1.12.0 even.

Let me know if you'd like me to send the above patch as a pull request.

My preferred patch would be to delete all the trying to guess the size from the io.Reader code here - trying to snoop the length out of io.Readers is a losing battle IMHO.

https://github.com/pkg/sftp/blob/f5f52ff56bd8711f52e51caf51cee29f0f093b9d/client.go#L1634-L1652

to make it say

if f.c.useConcurrentWrites { 
    return f.readFromConcurrent(r, remain) 
 }

puellanivis commented 3 years ago

The problem is that for short buffers, using concurrency is actually slower and more memory intensive than handling things sequentially, due to all the overhead of setting up the goroutines, etc.

So, we’re kind of “stuck”, we want to use concurrency when it’s enabled, but we also don’t want to shoot ourselves by always using it. As a simple example, anyone doing a io.Copy(f, buf) at every newline would incur a whole lot of overhead on something that could normally just be handled with one writeChunkAt.

So, if you were to submit a patch that deletes the whole trying to snoop the length of io:Reader then we almost certainly would not accept it. The question of if we should default to max concurrency if there is no type match is an interesting one though, and not one that I would immediately reject.

However, I think a interface{ Stat() (os.FileInfo, error) } type match would at least be a better choice than just *os.File now that I think about it.

drakkan commented 3 years ago

@ncw is the interface{ Stat() (os.FileInfo, error) } type match useful for your use case?

for my use case (small packets coming from a network stream) I used a bufio writer, with a configurable size, to accumulate contiguous data and activate concurrency here. I can also confirm that concurrent writes are only useful on high latency networks: if the client and server are on a local network I get better performance by disabling concurrency

ncw commented 3 years ago

For me, a ReadFromN interface would be the best where you pass in the size explicitly.

Rclone knows the size of the file it is transferring and it calls ReadFrom directly so it could call ReadFromN just as easily.

I could send patch to do that refactoring if you want and leave the size guessing in ReadFrom.

As for the interface Size () int64 would be my preference. Having to make up a Fileinfo would be a pain.

drakkan commented 3 years ago

So basically could something like this

diff --git a/client.go b/client.go
index de48d19..d139a61 100644
--- a/client.go
+++ b/client.go
@@ -1637,6 +1637,9 @@ func (f *File) ReadFrom(r io.Reader) (int64, error) {
                case interface{ Len() int }:
                        remain = int64(r.Len())

+               case interface{ Size() int64 }:
+                       remain = r.Size()
+
                case *io.LimitedReader:
                        remain = r.N

already be enough?

I think we could also consider adding a a ReadFromN method, @puellanivis what do you think about?

puellanivis commented 3 years ago

I still like the idea of using a Stat instead of the os.File even if it isn’t the best choice here. I like the Size() int64 interface idea as well.

drakkan commented 3 years ago

@ncw, please let me know if this is enough, thank you

ncw commented 3 years ago

@ncw, please let me know if this is enough, thank you

Yes that would work for me.

What should Size() return if the size of the file is unknown? Rclone uses -1 for this which is a bit non-standard.

puellanivis commented 3 years ago

I think following net/http.Request.ContentLength in this case would be useful:

    // ContentLength records the length of the associated content.
    // The value -1 indicates that the length is unknown.
    // Values >= 0 indicate that the given number of bytes may
    // be read from Body.
    //
    // For client requests, a value of 0 with a non-nil Body is
    // also treated as unknown.
    ContentLength int64

P.S.: looks like we might want a if remain < 0 { remain = math.MaxInt64 } or something so that it maxes out concurrency on an assumption it’s going to be long enough.

ncw commented 3 years ago

looks like we might want a if remain < 0 { remain = math.MaxInt64 } or something so that it maxes out concurrency on an assumption it’s going to be long enough.

That would work for me, definitely :-)

drakkan commented 3 years ago

@ncw, can you please confirm that git master works fine for rclone usage? Thank you

ncw commented 3 years ago

I can report that this fix does work - thank you :-)

I've updated rclone to use the new code.

If you did make a ReadFromN interface then I'd use that instead which saves the rather fragile interface conversions trying to read the size from the stream.

Thanks

puellanivis commented 3 years ago

ReadFromN

I think I missed this at some point. :thinking: I’d probably call it ReadFromWithLen(), and I don’t think it would be too hard to use it… it is basically already what we are using.

Although, now that I think about it, I think there might even be a better solution. Instead of focusing on how to get the value the heuristic is using to pick concurrency, we could instead provide ReadFromWithConcurrency(src io.Reader, concurrency int) which more directly controls the actual value we want to control, rather than just more tweaking of the heuristics. With such a function the caller could then strictly control how much concurrency they would like to use regardless of what we might heuristically choose for them otherwise. Then we wouldn’t even need to worry about providing some mock length to trigger full concurrency either, we could just directly specify it.

EDITED: to include proposed parameters to the ReadFromWithConcurrency rather than leaving it ambiguous about whether we’re just exporting any similarly named functions.

ncw commented 3 years ago

I think I missed this at some point. I’d probably call it ReadFromWithLen(), and I don’t think it would be too hard to use it… it is basically already what we are using.

Perhaps just making this function public would be enough to satisfy everyone? It would certainly work for me.

https://github.com/pkg/sftp/blob/2b80967078b846fb8a47aea993b8c294d2daa95c/client.go#L1503-L1504

Although, now that I think about it, I think there might even be a better solution. Instead of focusing on how to get the value the heuristic is using to pick concurrency, we could instead provide ReadFromWithConcurrency(src io.Reader, concurrency int) which more directly controls the actual value we want to control, rather than just more tweaking of the heuristics. With such a function the caller could then strictly control how much concurrency they would like to use regardless of what we might heuristically choose for them otherwise. Then we wouldn’t even need to worry about providing some mock length to trigger full concurrency either, we could just directly specify it.

I guess the question is whether the client could improve on this heuristic

https://github.com/pkg/sftp/blob/2b80967078b846fb8a47aea993b8c294d2daa95c/client.go#L1524-L1527

maxConcurrentRequests is settable via

https://github.com/pkg/sftp/blob/2b80967078b846fb8a47aea993b8c294d2daa95c/client.go#L90

So the user already has reasonable control over the concurrency.

puellanivis commented 3 years ago

The problem with exposing readFromConcurrent is that the parameter that we’re passing in, the remain is not actually a meaningful dial or knob for a user to control. That value only controls the heuristic driving how much concurrency to use. It serves no other purpose in the function.

So, rather than forcing users to read the code, and figure out which value causes the heuristic to do what they really want: control the level of concurrency, then that’s just a poor interface.

Knobs and dials exposed should directly control behavior, not just feed into a heuristic to produce a behavior. Because then the user has to know the implementation details of how that heuristic works, rather than just exposing the knob that they really want to dial.

And MaxConcurrentRequestsPerFile only controls the limits of concurrency, not the actual concurrency used. Rather than cryptically asking users to pass in math.MaxInt64 to get max concurrency, we should just tell them they can set any arbitrary value, and we’ll cap it at max concurrency, or even have 0 default to max concurrency…

ncw commented 3 years ago

From a client's perspective I know the size of the file I want to send.

I'd be thinking, OK I'll set concurrency to the maximum each time - that will make it go fast - job done.

But then I might look at the code and see that I just created 64 go routines for a 1k file and that is definitely too many as the file size is greater than the total size of the all buffers.

Then I'll end up doing a very similar sum to the one we have already - we never want concurrency * buffer size to be > the file size otherwise we are starting too many go routines.

 concurrency64 := remain/int64(f.c.maxPacket) + 1 // a bad guess, but better than no guess 
 if concurrency64 > int64(f.c.maxConcurrentRequests) || concurrency64 < 1 { 
    concurrency64 = int64(f.c.maxConcurrentRequests) 
 }

That code tries to maximise the concurrency to maxConcurrentRequests but if the file is shorter than the total buffer size then it uses fewer buffers.

My proposal would be to rename remain to be size and document it like this

// size should be the size of the file if known or -1 if unknown.
//
// The value of size will be used to calculate the number of buffers and concurrency
// used to transfer the file up to a maximum of that set with MaxConcurrentRequestsPerFile.
// If size is < 0 then the maximum concurrency will be used.

That then makes the parameter meaningful to the user.

I agree with you that fewer knobs is better. I'm just not sure that the user will have enough information to set a sensible value for concurrency without knowing how the internals of the sftp library works, whereas the user is very likely to know how big the file is they want to transfer.

puellanivis commented 3 years ago

I just really don’t see how exposing a “size” argument that isn’t used for anything other than calculating concurrency would be anything more than just confusing. ReadFromWithConcurrency(…) implies that I’m making this call with a specific concurrency, not a “file size that predicts how many concurrent workers to use”.

If someone is using this proposed function, then they’re doing so because ReadFrom() itself lacks the concurrency level that the caller deems necessary, not because it calculates the size wrong. The later is really only one situation in which someone might want to use more or less concurrent threads than the heuristic would pick.

If we include a size argument, then function is ReadFromWithSize() . not with concurrency. And WithSize is going to imply that the size is an absolute limiting factor on the read, à la io.LimitedReader .

ncw commented 3 years ago

I just really don’t see how exposing a “size” argument that isn’t used for anything other than calculating concurrency would be anything more than just confusing. ReadFromWithConcurrency(…) implies that I’m making this call with a specific concurrency, not a “file size that predicts how many concurrent workers to use”.

That is certainly a nice clear interface.

What should we write in the help text about the concurrency parameter?

Is is safe/efficient to pass in a fixed (large) number even when the files are small? Or would the advice be that bufferSize * concurrency <= fileSize because that is what the current code does?

puellanivis commented 3 years ago

It’s definitely safe to pass in a large number, but the help should note that doing so could end up with spurious requests if it is over bufferSize * concurrency > fileSize .

The real question is should we cap it at our MaxConcurrentRequestPerFile ? (I think we should, the value here is important, but if a user demands a higher number, should we also respect that?)

ncw commented 3 years ago

I think capping is probably a good idea to save "accidents" :-)

puellanivis commented 3 years ago

Also it makes it convenient to just set a particularly large value that you might be ok with but not “max concurrency”, but will get applied down to whatever the client insist on.

drakkan commented 3 years ago

@ncw, can you please give a try to ReadFromWithConcurrency https://github.com/pkg/sftp/commit/5b98d05076b8ac5e6e762559e7c2d69efe1676ee? Thank you

drakkan commented 3 years ago

Fixed in v1.13.1

pkg / sftp

Performance regression from v1.12.0 to v1.13.0 (and master) #426