Open drimmeer opened 1 month ago
While using wrapio.NewHashReader()
would definitely make it easy enough to get both a hash and the file at the same time, it would however break any ReadFrom
or WriteTo
that we implement, which is usually how we can ensure the best performance, rather than reading individual blocks and then waiting for the response. If you’re using WithConcurrentReads(true)
with v1 of the API, or if you’re using v2 of the API, then you will see significantly reduced performance with a wrapped IO than you would get by downloading the file locally, and then generating the SHA hash after it has arrived.
Good to know, thanks for the tip!
Thank you Cassondra very much for the tips.
We are indeed relying on the WithConcurrentReads(true)
for the required performance, cause we might transfer hundreds of GBs, or even TBs.
However, we also need a way to get both a hash and the file at the same time, because the source is not a physical file, but a pipe, while the destination is a remote file. So there is no way to generate the SHA out of a local file, but only a remote file, which will double the transfer data then!
Do you have any suggestion on how we could overcome this? Is it possible to wrap the ReadFrom()
or WriteTo()
that you implement to get hash out of them?
Is it possible to do in this way while performance of your code won't be affected?
fSource, err := os.Open(pipeNameLocal)
fDestination, err := sftpClient.Create(destNameRemote)
s := sha256.New()
hr := NewHashReader(s, fSource)
io.Copy(fDestination, hr)
// Use the Sum()s which in this case we'll just print it out.
fmt.Println(hex.EncodeToString(s.Sum(nil)))
I tested the above code and compared it with other approach, and confirmed that it has significantly reduced performance. Here is the benchmark:
But as I said, we cannot do like #3 above, because in our case the local file is a pipe. What can I do then?
I think, yeah, you should be able to wrap the ReadFrom
or WriteTo
functions on your own to perform a double write, once to a hash, then to the underlying writer.
Hm, you might also be able to use a NewHashWriter
to wrap the local file operation, and still get the WriteTo
functionality from this package?
Thanks Cassondra for the confirmation.
But as seen from my testing above, using the NewHashReader
significantly reduced the performance. NewHashReader
is also a wrapper to ReadFrom()
, am I not right? What could be different if I do the wrapping by my own code?
Or you meant using NewHashWriter
won't affect the performance as using NewHashReader
does?
Yeah, I meant using NewHashWriter
as that would allow our WriteTo
to still work.
Hi Cassondra, I tried wrapping the writer by my own but it also was very slow. Here is how I did it:
I created a wrapper like this:
// sha256_writer.go implements the io.Writer interface.
type sha256_writer struct {
sha256_calculator hash.Hash
origin_writer io.Writer
}
// Write implements the io.Writer interface.
func (this_writer *sha256_writer) Write(write_out_buffer []byte) (int, error) {
this_writer.sha256_calculator.Write(write_out_buffer)
return this_writer.origin_writer.Write(write_out_buffer)
}
func NewSha256Writer(writer_to_be_wrap io.Writer) *sha256_writer {
return &sha256_writer{sha256_calculator: sha256.New(), origin_writer: writer_to_be_wrap}
}
func (this_writer *sha256_writer) GetHashSum() []byte {
return this_writer.sha256_calculator.Sum(nil)
}
Then I used it like this:
fSource, err := os.Open(srcNameLocal)
fDestination, err := sftpClient.Create(destNameRemote)
hashWriter := NewSha256Writer(fDestination)
_, err = io.Copy( hashWriter, fSource)
log.Println("File sent with sha256 checksum: " + hex.EncodeToString(hashWriter.GetHashSum()))
But it still took 12 minutes to transfer that 900M file, while it took only 1 minute without the wrapper...
Using the NewHashWriter like this is even slower, taking 14 minutes to tranfer.
fSource, err := os.Open(srcNameLocal)
fDestination, err := sftpClient.Create(destNameRemote)
sha256 := sha256.New()
hashWriter := wrapio.NewHashWriter(sha256, fDestination)
_, err = io.Copy( hashWriter, fSource)
log.Println("File sent with sha256 checksum: " + hex.EncodeToString(sha256.Sum(nil)))
Good news! I finally got what 'WriteTo' mean by you, and wrapped my reader around os.File instead of io.Reader. That way it worked perfectly! The performance was not dropped anymore!
Thank you for the help!
So, yeah, either way the io.Copy
goes, you want to ensure that you’re wrapping the other reader/writer from our sftp.File
object. This will ensure the sftp
implementation of either WriteTo
or ReadFrom
will be used.
I played around with this some more and I actually got the best performance with this wrapper, the key being to impl the Size() int64
method.
type HashReader struct {
size int64
r io.ReadCloser
h hash.Hash
}
func NewHashReader(r io.ReadCloser, size int64, h hash.Hash) HashReader {
return HashReader{size, r, h}
}
func NewHashReaderFile(path string, h hash.Hash) (HashReader, error) {
fh, err := os.Open(path)
if err != nil {
return HashReader{}, err
}
if stat, err := fh.Stat(); err != nil {
return HashReader{}, err
} else {
return NewHashReader(fh, stat.Size(), h), nil
}
}
func (ha HashReader) Read(p []byte) (int, error) {
n, err := ha.r.Read(p)
if n > 0 {
_, _ = ha.h.Write(p[:n])
}
return n, err
}
func (ha HashReader) Hash() []byte {
return ha.h.Sum(nil)
}
func (ha HashReader) HashStr() string {
return hex.EncodeToString(ha.Hash())
}
func (ha HashReader) Size() int64 {
return ha.size
}
func (ha HashReader) Close() error {
return ha.r.Close()
}
Then just make sure you use fh.ReadFrom(wrapper)
where fh is an sftp.File
Also do not wrap the os.File
object with bufio
that also tanks performance.
I strongly recommend making your HashReader
a pointer. Since it contains interfaces, it’s functionality is already non-value-like, and mixed value-like and reference-like behavior can lead to very unexpected and difficult to debug bugs.
You also might want to use the WriteTo
or ReadFrom
calls on the sftp.File
directly, which will ensure you get the SFTP implementation, and not the other-side implementation. (os.File’s WriteTo
is faster than a naïve read to buffer write from buffer, but it doesn’t put multiple reads/writes in flight to the SFTP file.)
I actually started with it as a pointer but making it a non-pointer lead to a consistent 30-40 mb/s increase in transfer speed (which surprised me I only tried it on a whim). And yeah it was necessary to use ReadFrom
to get the best speed.
Huh… interesting. That’s something I wouldn’t have expected.
Please advice on how to generate a digest (specifically, the Sha256 value) of the file being transferred via this sftp tool? The counterpart in Java is like this:
I know that to generate the Sha256 of a local file, the Go code would be like this:
But how to get that during the sftp transfer?
BTW, currently I wrote code like below to do file transfer (error handling being removed just to show the major logic):
or
Thank you in advance.