Closed ncw closed 3 years ago
@ncw, do you have concurrent writes enabled?
Do you have concurrent writes enabled?
No concurrent writes are not enabled.
Writes are by default not handled concurrently, and instead are handled entirely synchronously, which makes it quite slow for long fat pipes. We opted the default to be “safe, but slow” rather than “dangerous, but fast”.
After having read the documentation, we should be able to to queue things up in an ordered manner, and without the free concurrency that is being used. This would at least speed up the the long fat pipes but also retain a lot of safety. 🤔 Hm.
Writes are by default not handled concurrently, and instead are handled entirely synchronously, which makes it quite slow for long fat pipes.
Rclone uses the ReadFrom interface
https://github.com/pkg/sftp/blob/f5f52ff56bd8711f52e51caf51cee29f0f093b9d/client.go#L1623-L1630
This is specifically designed to avoid the problems in the Write interface
https://github.com/pkg/sftp/blob/f5f52ff56bd8711f52e51caf51cee29f0f093b9d/client.go#L1324-L1332
We opted the default to be “safe, but slow” rather than “dangerous, but fast”.
From a users point of view this speedup used to work and now doesn't :-(
Can we revert the changes that caused the regression? I'm guessing not easily since there looked to be a lot of changes in that code.
No, we cannot easily change this. The problem is that with parallel writes being done it’s possible that a write at n+1 could succeed, but a write at n could fail. Then the file is n+2 long, even though the nᵗʰ block is empty. It’s the responsibility of the caller to ensure that the final end length is only valid data, there is no way for us to really ensure this.
If you’re fine with that risk, then turn on concurrent writes, and see your performance go up.
As I mentioned, we might be able to do a “safer” version, where we just push everything in order sequentially, but as the writes would be non-overlapping, the end server could still parallelize them anyways, so we’re basically right back to the problem of writes potentially leaving a hole of data in the file… I think we could overlap the writes by n-bytes and a server would then be forced by the standard to handle the writes sequentially, and not out-of-order, but that would be a particularly weird and overly clever hack…
I think it’s still way safer to just require users to actively turn on potentially unsafe behavior, even if that unsafe behavior is pretty unlikely in the normal case.
Hi, happy Easter,
@ncw with concurrent writes enabled if an upload fails and you try to resume it, you will likely end up with a corrupted file, so we basically fixed a bug.
Last week, an SFTPGo user, using the SFTP proxy feature with rsync.net as a backend, reported slow uploads. I added a setting to enable concurrent writes (I have disabled resuming uploads with this setting enabled) and the upload speed has gone from 5MB /s to 15MB/s, so pkg/sftp still has this feature it is just not enabled by default to prevent sneaky bugs
So what I'm understanding from the above is that concurrent writes used to be enabled (for v1.12.0) but is no longer enabled (for v1.13.0) for safety reasons and it is necessary to enable them specifically now?
I'll give it a go in a moment.
So what I'm understanding from the above is that concurrent writes used to be enabled (for v1.12.0) but is no longer enabled (for v1.13.0) for safety reasons and it is necessary to enable them specifically now?
I'll give it a go in a moment.
Yes, this should work, please let us know your results, thank you
I'll add a note to the changelog
There was no immediate effect setting the flag.
In order to make it work I had to add this patch
diff --git a/client.go b/client.go
index 81b2d6c..6f2c79e 100644
--- a/client.go
+++ b/client.go
@@ -1641,8 +1641,8 @@ func (f *File) ReadFrom(r io.Reader) (int64, error) {
case *io.LimitedReader:
remain = r.N
- case *os.File:
- // For files, always presume max concurrency.
+ default:
+ // For every thing else, always presume max concurrency.
remain = math.MaxInt64
}
After that it did work very well
* 499.91M: 1% /499.910M, 735.491k/s, 11m26s
* 499.91M: 2% /499.910M, 620.767k/s, 13m25s
* 499.91M: 3% /499.910M, 522.179k/s, 15m47s
* 499.91M: 5% /499.910M, 694.876k/s, 11m39s
* 499.91M: 7% /499.910M, 949.944k/s, 8m17s
* 499.91M: 11% /499.910M, 1.307M/s, 5m40s
So much faster than v1.12.0 even.
Let me know if you'd like me to send the above patch as a pull request.
My preferred patch would be to delete all the trying to guess the size from the io.Reader code here - trying to snoop the length out of io.Reader
s is a losing battle IMHO.
https://github.com/pkg/sftp/blob/f5f52ff56bd8711f52e51caf51cee29f0f093b9d/client.go#L1634-L1652
to make it say
if f.c.useConcurrentWrites {
return f.readFromConcurrent(r, remain)
}
The problem is that for short buffers, using concurrency is actually slower and more memory intensive than handling things sequentially, due to all the overhead of setting up the goroutines, etc.
So, we’re kind of “stuck”, we want to use concurrency when it’s enabled, but we also don’t want to shoot ourselves by always using it. As a simple example, anyone doing a io.Copy(f, buf)
at every newline would incur a whole lot of overhead on something that could normally just be handled with one writeChunkAt
.
So, if you were to submit a patch that deletes the whole trying to snoop the length of io:Reader
then we almost certainly would not accept it. The question of if we should default to max concurrency if there is no type match is an interesting one though, and not one that I would immediately reject.
However, I think a interface{ Stat() (os.FileInfo, error) }
type match would at least be a better choice than just *os.File
now that I think about it.
@ncw is the interface{ Stat() (os.FileInfo, error) }
type match useful for your use case?
for my use case (small packets coming from a network stream) I used a bufio writer, with a configurable size, to accumulate contiguous data and activate concurrency here. I can also confirm that concurrent writes are only useful on high latency networks: if the client and server are on a local network I get better performance by disabling concurrency
For me, a ReadFromN interface would be the best where you pass in the size explicitly.
Rclone knows the size of the file it is transferring and it calls ReadFrom directly so it could call ReadFromN just as easily.
I could send patch to do that refactoring if you want and leave the size guessing in ReadFrom.
As for the interface Size () int64
would be my preference. Having to make up a Fileinfo would be a pain.
So basically could something like this
diff --git a/client.go b/client.go
index de48d19..d139a61 100644
--- a/client.go
+++ b/client.go
@@ -1637,6 +1637,9 @@ func (f *File) ReadFrom(r io.Reader) (int64, error) {
case interface{ Len() int }:
remain = int64(r.Len())
+ case interface{ Size() int64 }:
+ remain = r.Size()
+
case *io.LimitedReader:
remain = r.N
already be enough?
I think we could also consider adding a a ReadFromN
method, @puellanivis what do you think about?
I still like the idea of using a Stat
instead of the os.File
even if it isn’t the best choice here. I like the Size() int64
interface idea as well.
@ncw, please let me know if this is enough, thank you
Yes that would work for me.
What should Size()
return if the size of the file is unknown? Rclone uses -1
for this which is a bit non-standard.
I think following net/http.Request.ContentLength
in this case would be useful:
// ContentLength records the length of the associated content.
// The value -1 indicates that the length is unknown.
// Values >= 0 indicate that the given number of bytes may
// be read from Body.
//
// For client requests, a value of 0 with a non-nil Body is
// also treated as unknown.
ContentLength int64
P.S.: looks like we might want a if remain < 0 { remain = math.MaxInt64 }
or something so that it maxes out concurrency on an assumption it’s going to be long enough.
looks like we might want a
if remain < 0 { remain = math.MaxInt64 }
or something so that it maxes out concurrency on an assumption it’s going to be long enough.
That would work for me, definitely :-)
@ncw, can you please confirm that git master works fine for rclone usage? Thank you
I can report that this fix does work - thank you :-)
I've updated rclone to use the new code.
If you did make a ReadFromN
interface then I'd use that instead which saves the rather fragile interface conversions trying to read the size from the stream.
Thanks
ReadFromN
I think I missed this at some point. :thinking: I’d probably call it ReadFromWithLen()
, and I don’t think it would be too hard to use it… it is basically already what we are using.
Although, now that I think about it, I think there might even be a better solution. Instead of focusing on how to get the value the heuristic is using to pick concurrency, we could instead provide ReadFromWithConcurrency(src io.Reader, concurrency int)
which more directly controls the actual value we want to control, rather than just more tweaking of the heuristics. With such a function the caller could then strictly control how much concurrency they would like to use regardless of what we might heuristically choose for them otherwise. Then we wouldn’t even need to worry about providing some mock length to trigger full concurrency either, we could just directly specify it.
EDITED: to include proposed parameters to the ReadFromWithConcurrency
rather than leaving it ambiguous about whether we’re just exporting any similarly named functions.
I think I missed this at some point. I’d probably call it
ReadFromWithLen()
, and I don’t think it would be too hard to use it… it is basically already what we are using.
Perhaps just making this function public would be enough to satisfy everyone? It would certainly work for me.
https://github.com/pkg/sftp/blob/2b80967078b846fb8a47aea993b8c294d2daa95c/client.go#L1503-L1504
Although, now that I think about it, I think there might even be a better solution. Instead of focusing on how to get the value the heuristic is using to pick concurrency, we could instead provide
ReadFromWithConcurrency(src io.Reader, concurrency int)
which more directly controls the actual value we want to control, rather than just more tweaking of the heuristics. With such a function the caller could then strictly control how much concurrency they would like to use regardless of what we might heuristically choose for them otherwise. Then we wouldn’t even need to worry about providing some mock length to trigger full concurrency either, we could just directly specify it.
I guess the question is whether the client could improve on this heuristic
https://github.com/pkg/sftp/blob/2b80967078b846fb8a47aea993b8c294d2daa95c/client.go#L1524-L1527
maxConcurrentRequests
is settable via
https://github.com/pkg/sftp/blob/2b80967078b846fb8a47aea993b8c294d2daa95c/client.go#L90
So the user already has reasonable control over the concurrency.
The problem with exposing readFromConcurrent
is that the parameter that we’re passing in, the remain
is not actually a meaningful dial or knob for a user to control. That value only controls the heuristic driving how much concurrency to use. It serves no other purpose in the function.
So, rather than forcing users to read the code, and figure out which value causes the heuristic to do what they really want: control the level of concurrency, then that’s just a poor interface.
Knobs and dials exposed should directly control behavior, not just feed into a heuristic to produce a behavior. Because then the user has to know the implementation details of how that heuristic works, rather than just exposing the knob that they really want to dial.
And MaxConcurrentRequestsPerFile
only controls the limits of concurrency, not the actual concurrency used. Rather than cryptically asking users to pass in math.MaxInt64
to get max concurrency, we should just tell them they can set any arbitrary value, and we’ll cap it at max concurrency, or even have 0 default to max concurrency…
From a client's perspective I know the size of the file I want to send.
I'd be thinking, OK I'll set concurrency to the maximum each time - that will make it go fast - job done.
But then I might look at the code and see that I just created 64 go routines for a 1k file and that is definitely too many as the file size is greater than the total size of the all buffers.
Then I'll end up doing a very similar sum to the one we have already - we never want concurrency * buffer size to be > the file size otherwise we are starting too many go routines.
concurrency64 := remain/int64(f.c.maxPacket) + 1 // a bad guess, but better than no guess
if concurrency64 > int64(f.c.maxConcurrentRequests) || concurrency64 < 1 {
concurrency64 = int64(f.c.maxConcurrentRequests)
}
That code tries to maximise the concurrency to maxConcurrentRequests
but if the file is shorter than the total buffer size then it uses fewer buffers.
My proposal would be to rename remain
to be size
and document it like this
// size should be the size of the file if known or -1 if unknown.
//
// The value of size will be used to calculate the number of buffers and concurrency
// used to transfer the file up to a maximum of that set with MaxConcurrentRequestsPerFile.
// If size is < 0 then the maximum concurrency will be used.
That then makes the parameter meaningful to the user.
I agree with you that fewer knobs is better. I'm just not sure that the user will have enough information to set a sensible value for concurrency without knowing how the internals of the sftp library works, whereas the user is very likely to know how big the file is they want to transfer.
I just really don’t see how exposing a “size” argument that isn’t used for anything other than calculating concurrency would be anything more than just confusing. ReadFromWithConcurrency(…)
implies that I’m making this call with a specific concurrency, not a “file size that predicts how many concurrent workers to use”.
If someone is using this proposed function, then they’re doing so because ReadFrom()
itself lacks the concurrency level that the caller deems necessary, not because it calculates the size wrong. The later is really only one situation in which someone might want to use more or less concurrent threads than the heuristic would pick.
If we include a size argument, then function is ReadFromWithSize()
. not with concurrency. And WithSize
is going to imply that the size is an absolute limiting factor on the read, à la io.LimitedReader
.
I just really don’t see how exposing a “size” argument that isn’t used for anything other than calculating concurrency would be anything more than just confusing.
ReadFromWithConcurrency(…)
implies that I’m making this call with a specific concurrency, not a “file size that predicts how many concurrent workers to use”.
That is certainly a nice clear interface.
What should we write in the help text about the concurrency parameter?
Is is safe/efficient to pass in a fixed (large) number even when the files are small? Or would the advice be that bufferSize * concurrency <= fileSize because that is what the current code does?
It’s definitely safe to pass in a large number, but the help should note that doing so could end up with spurious requests if it is over bufferSize * concurrency > fileSize
.
The real question is should we cap it at our MaxConcurrentRequestPerFile
? (I think we should, the value here is important, but if a user demands a higher number, should we also respect that?)
I think capping is probably a good idea to save "accidents" :-)
Also it makes it convenient to just set a particularly large value that you might be ok with but not “max concurrency”, but will get applied down to whatever the client insist on.
@ncw, can you please give a try to ReadFromWithConcurrency
https://github.com/pkg/sftp/commit/5b98d05076b8ac5e6e762559e7c2d69efe1676ee? Thank you
Fixed in v1.13.1
While investigating https://github.com/rclone/rclone/issues/5197 I discovered that doing transfers to rsync.net had gone from 300 KB/s in v1.12.0 to 100KB/s in v1.13.0
The rsync.net servers run FreeBSD which is one unusual thing about them and the other is that they are a long way away from me (150ms) so have the usual problems with long fat TCP pipes. Rclone uses the
ReadFrom
interface in the sftp client so that the sftp library can increase the number of outstanding packets to help with this.Here are some tests (my upload is capable of 2MB/s).
v1.12.0
v1.13.0
master
I bisected this change and discovered this commit is probably the problem. I'm reasonably sure that is the problem commit, and it is certainly touching the code in question. Before this commit at 0be6950c0e91d5cb73a4d690270d3a5010ac9808 the performance is definitely OK and after it is is bad. However there is some variation after the commit so there may be more commits involved.
Cc: @puellanivis