pkg / sftp

SFTP support for the go.crypto/ssh package
BSD 2-Clause "Simplified" License
1.52k stars 380 forks source link

Question about generating a digest value during sftp transfer #598

Open drimmeer opened 1 month ago

drimmeer commented 1 month ago

Please advice on how to generate a digest (specifically, the Sha256 value) of the file being transferred via this sftp tool? The counterpart in Java is like this:

        MessageDigest messageDigestSha256=MessageDigest.getInstance( "SHA-256");
            while ((bytesRead = inStream.read(dataBytes)) != -1) {
                outStream.write(dataBytes, 0, bytesRead);
                messageDigestSha256.update(dataBytes, 0, bytesRead); // generate digest
            }
        String EncodedDigestSha256 = Hex.encodeHexString(messageDigestSha256.digest());
...

I know that to generate the Sha256 of a local file, the Go code would be like this:

            file, err := os.Open("file.txt")
            hash := sha256.New()
            _, err := io.Copy(hash, file)
            hashBytes := hash.Sum(nil)

But how to get that during the sftp transfer?

BTW, currently I wrote code like below to do file transfer (error handling being removed just to show the major logic):

    fDestination, err := sftpClient.Create(destNameRemote)
    fSource, err := os.Open(srcNameLocal)
    _, err = io.Copy(fDestination, fSource)

or

    fSource, err := sftpClient.Open(srcNameRemote)
    fDestination, err := os.Create(destNameLocal)
    _, err = io.Copy(fDestination, fSource)

Thank you in advance.

lespea commented 1 month ago

I've used https://pkg.go.dev/github.com/icub3d/gop/wrapio

puellanivis commented 1 month ago

While using wrapio.NewHashReader() would definitely make it easy enough to get both a hash and the file at the same time, it would however break any ReadFrom or WriteTo that we implement, which is usually how we can ensure the best performance, rather than reading individual blocks and then waiting for the response. If you’re using WithConcurrentReads(true) with v1 of the API, or if you’re using v2 of the API, then you will see significantly reduced performance with a wrapped IO than you would get by downloading the file locally, and then generating the SHA hash after it has arrived.

lespea commented 1 month ago

Good to know, thanks for the tip!

drimmeer commented 1 month ago

Thank you Cassondra very much for the tips.

We are indeed relying on the WithConcurrentReads(true) for the required performance, cause we might transfer hundreds of GBs, or even TBs.

However, we also need a way to get both a hash and the file at the same time, because the source is not a physical file, but a pipe, while the destination is a remote file. So there is no way to generate the SHA out of a local file, but only a remote file, which will double the transfer data then!

Do you have any suggestion on how we could overcome this? Is it possible to wrap the ReadFrom() or WriteTo() that you implement to get hash out of them?

drimmeer commented 1 month ago

Is it possible to do in this way while performance of your code won't be affected?

fSource, err := os.Open(pipeNameLocal)
fDestination, err := sftpClient.Create(destNameRemote)

s := sha256.New()
hr := NewHashReader(s, fSource)

io.Copy(fDestination, hr)
// Use the Sum()s which in this case we'll just print it out.
fmt.Println(hex.EncodeToString(s.Sum(nil)))
drimmeer commented 1 month ago

I tested the above code and compared it with other approach, and confirmed that it has significantly reduced performance. Here is the benchmark:

  1. To transfer a 900M file (not a pipe) from z/OS to Linux, the above code took 14~16 minutes;
  2. While without any sha256 hash calculation, it took only 1 minute;
  3. Then I tried calculating the hash at the beginning, it took only 1 minute for transfer, and 1~6 seconds for hash calculation)

But as I said, we cannot do like #3 above, because in our case the local file is a pipe. What can I do then?

puellanivis commented 1 month ago

I think, yeah, you should be able to wrap the ReadFrom or WriteTo functions on your own to perform a double write, once to a hash, then to the underlying writer.

Hm, you might also be able to use a NewHashWriter to wrap the local file operation, and still get the WriteTo functionality from this package?

drimmeer commented 1 month ago

Thanks Cassondra for the confirmation. But as seen from my testing above, using the NewHashReader significantly reduced the performance. NewHashReader is also a wrapper to ReadFrom(), am I not right? What could be different if I do the wrapping by my own code?

Or you meant using NewHashWriter won't affect the performance as using NewHashReader does?

puellanivis commented 1 month ago

Yeah, I meant using NewHashWriter as that would allow our WriteTo to still work.

drimmeer commented 1 month ago

Hi Cassondra, I tried wrapping the writer by my own but it also was very slow. Here is how I did it:

I created a wrapper like this:

// sha256_writer.go implements the io.Writer interface.
type sha256_writer struct {
    sha256_calculator hash.Hash
    origin_writer     io.Writer
}

// Write implements the io.Writer interface.
func (this_writer *sha256_writer) Write(write_out_buffer []byte) (int, error) {
    this_writer.sha256_calculator.Write(write_out_buffer)
    return this_writer.origin_writer.Write(write_out_buffer)
}

func NewSha256Writer(writer_to_be_wrap io.Writer) *sha256_writer {
    return &sha256_writer{sha256_calculator: sha256.New(), origin_writer: writer_to_be_wrap}
}

func (this_writer *sha256_writer) GetHashSum() []byte {
    return this_writer.sha256_calculator.Sum(nil)
}

Then I used it like this:

    fSource, err := os.Open(srcNameLocal)
    fDestination, err := sftpClient.Create(destNameRemote)
    hashWriter := NewSha256Writer(fDestination)
    _, err = io.Copy( hashWriter, fSource)
    log.Println("File sent with sha256 checksum: " + hex.EncodeToString(hashWriter.GetHashSum()))

But it still took 12 minutes to transfer that 900M file, while it took only 1 minute without the wrapper...

drimmeer commented 1 month ago

Using the NewHashWriter like this is even slower, taking 14 minutes to tranfer.


    fSource, err := os.Open(srcNameLocal)
    fDestination, err := sftpClient.Create(destNameRemote)
    sha256 := sha256.New()
    hashWriter := wrapio.NewHashWriter(sha256, fDestination)
    _, err = io.Copy( hashWriter, fSource)
    log.Println("File sent with sha256 checksum: " + hex.EncodeToString(sha256.Sum(nil)))
drimmeer commented 1 month ago

Good news! I finally got what 'WriteTo' mean by you, and wrapped my reader around os.File instead of io.Reader. That way it worked perfectly! The performance was not dropped anymore!

Thank you for the help!

puellanivis commented 1 month ago

So, yeah, either way the io.Copy goes, you want to ensure that you’re wrapping the other reader/writer from our sftp.File object. This will ensure the sftp implementation of either WriteTo or ReadFrom will be used.

lespea commented 5 days ago

I played around with this some more and I actually got the best performance with this wrapper, the key being to impl the Size() int64 method.

type HashReader struct {
    size int64
    r    io.ReadCloser
    h    hash.Hash
}

func NewHashReader(r io.ReadCloser, size int64, h hash.Hash) HashReader {
    return HashReader{size, r, h}
}

func NewHashReaderFile(path string, h hash.Hash) (HashReader, error) {
    fh, err := os.Open(path)
    if err != nil {
        return HashReader{}, err
    }

    if stat, err := fh.Stat(); err != nil {
        return HashReader{}, err
    } else {
        return NewHashReader(fh, stat.Size(), h), nil
    }
}

func (ha HashReader) Read(p []byte) (int, error) {
    n, err := ha.r.Read(p)
    if n > 0 {
        _, _ = ha.h.Write(p[:n])
    }
    return n, err
}

func (ha HashReader) Hash() []byte {
    return ha.h.Sum(nil)
}

func (ha HashReader) HashStr() string {
    return hex.EncodeToString(ha.Hash())
}

func (ha HashReader) Size() int64 {
    return ha.size
}

func (ha HashReader) Close() error {
    return ha.r.Close()
}

Then just make sure you use fh.ReadFrom(wrapper) where fh is an sftp.File

lespea commented 5 days ago

Also do not wrap the os.File object with bufio that also tanks performance.

puellanivis commented 5 days ago

I strongly recommend making your HashReader a pointer. Since it contains interfaces, it’s functionality is already non-value-like, and mixed value-like and reference-like behavior can lead to very unexpected and difficult to debug bugs.

You also might want to use the WriteTo or ReadFrom calls on the sftp.File directly, which will ensure you get the SFTP implementation, and not the other-side implementation. (os.File’s WriteTo is faster than a naïve read to buffer write from buffer, but it doesn’t put multiple reads/writes in flight to the SFTP file.)

lespea commented 5 days ago

I actually started with it as a pointer but making it a non-pointer lead to a consistent 30-40 mb/s increase in transfer speed (which surprised me I only tried it on a whim). And yeah it was necessary to use ReadFrom to get the best speed.

puellanivis commented 5 days ago

Huh… interesting. That’s something I wouldn’t have expected.