Implementation with a progress bar

Agagamand commented 5 years ago

I need to calculate checksums of large files with a program bar. Is the following code correct for this purpose?

        private ulong GetXXHash(string filename, IProgress<double> progress = null)
        {
            ulong hash = 0;
            var buffer = new byte[1048576]; // 1MB buffer
            using (var entryStream = File.OpenRead(filename))
            {
                int currentBlockSize = 0;

                while ((currentBlockSize = entryStream.Read(buffer, 0, buffer.Length)) > 0)
                {
                    hash += XXHash.Hash64(buffer, (ulong)currentBlockSize);
                    progress?.Report(currentBlockSize);
                }
            }
            return hash;
        }

ssg commented 5 years ago

Hi @Agagamand, great question. There is no easy way to do it currently other than overriding FileStream class to provide progress. An example would be:

public class XXHashFileStream: FileStream
{
    private IProgress<double> progress;

    public XXHashFileStream(string filename, IProgress<double> progress): base(filename, FileMode.Open, FileAccess.Read)
    {
        this.progress = progress;
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
         int result = base.Read(buffer, offset, count);
         progress?.Report(result);
         return result;
    }
}

private ulong GetXXHash(string filename, IProgress<double> progress = null)
{
     ulong hash;
     using (var entryStream = new XXHashFileStream(filename, progress))
     {
          hash = XXHash.Hash64(entryStream);
     }
     return hash;
 }

Note that this will use the XXHash's own buffer size for progress reporting, which is far less than a MB so it may not be ideal for your use case.

I haven't considered update-based APIs because I specifically targeted performant in-memory operations but that might be an important use case. Let me think about it.

And please let me know if the workaround is good enough for you.

Agagamand commented 5 years ago

I need XXHash for very large files. Your code does not fit, memory overflow error.

I started using this: https://github.com/differentrain/YYProject.XXHash/blob/master/XXHash/YYProject.XXHash/XXHash.cs

private string GetYYP_XXHash(string filename, IProgress<double> progress = null)
{
    byte[] buffer;
    byte[] oldBuffer;
    int bytesRead;
    int oldBytesRead;
    long size;
    long totalBytesRead = 0;

    using (Stream stream = File.OpenRead(filename))
    using (YYProject.XXHash.XXHash64 hashAlgorithm = YYProject.XXHash.XXHash64.Create())
    {
        size = stream.Length;
        buffer = new byte[1048576]; // 1MB buffer

        bytesRead = stream.Read(buffer, 0, buffer.Length);
        totalBytesRead += bytesRead;

        do
        {
            oldBytesRead = bytesRead;
            oldBuffer = buffer;

            buffer = new byte[1048576];
            bytesRead = stream.Read(buffer, 0, buffer.Length);

            totalBytesRead += bytesRead;

            if (bytesRead == 0)
            {
                hashAlgorithm.TransformFinalBlock(oldBuffer, 0, oldBytesRead);
            }
            else
            {
                hashAlgorithm.TransformBlock(oldBuffer, 0, oldBytesRead, oldBuffer, 0);
            }

            progress?.Report(totalBytesRead);

        } while (bytesRead != 0);

        StringBuilder sBuilder = new StringBuilder();
        for (int i = 0; i < hashAlgorithm.Hash.Length; i++)
        {
            sBuilder.Append(hashAlgorithm.Hash[i].ToString("x2"));
        }

        return sBuilder.ToString();
    }
}

Progress is not displayed perfectly, but is suitable for very large files

But I am not happy with the speed of YYProject.XXHash. Probably have to go to the native console application.

ssg commented 5 years ago

Did you try my workaround with a custom stream?

Agagamand commented 5 years ago

I copy&paste your code and try get hash 1 GB file. Summary: System.OutOfMemoryException in progress?.Report(result);

ssg commented 5 years ago

Ok let me try to reproduce this locally tonight. I'm opening the issue to keep track of this. Thanks!

ssg commented 5 years ago

I wrote a simple console app using the workaround I mentioned. It works fine without any errors:

using System;
using System.IO;
using HashDepot;

namespace xxhasher
{
    class XXHashFileStream: FileStream
    {
        private IProgress<double> progress;

        public XXHashFileStream(string fileName, IProgress<double> progress)
            : base(fileName, FileMode.Open, FileAccess.Read)
        {
            this.progress = progress;
        }

        public override int Read(byte[] array, int offset, int count)
        {
            int readBytes = base.Read(array, offset, count);
            progress?.Report(this.Position);
            return readBytes;
        }
    }

    class MyProgressReporter : IProgress<double>
    {
        public void Report(double value)
        {
            if (value % 1000000 == 0)
            {
                Console.Write($"Progress: {value}\r");
            }
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            if (args.Length != 1)
            {
                Console.WriteLine("Usage: xxhasher <filename>");
                Environment.Exit(1);
            }
            string fileName = args[0];
            using (var stream = new XXHashFileStream(fileName, new MyProgressReporter()))
            {
                ulong hash = XXHash.Hash64(stream);
                Console.WriteLine();
                Console.WriteLine($"Hash value: {hash:X}");
            }
        }
    }
}

Here is the output for a 2gb file created with fsutil file createnew dummy 2000000000:

Progress: 2000000000
Hash value: 350FCE3512A2E8A9

And process memory usage stays steady at 9MB. Nowhere close to an out of memory situation.

Agagamand commented 5 years ago

This code works. BUT your implementation is slower than YYProject.XXHash. 1GB of hash is calculated one second longer.

ssg commented 5 years ago

Ok, I'm closing as there is apparently no bug causing "out of memory" error. I'll also consider an update-based API to handle these scenarios with high perf if there is more demand to it.

ssg / HashDepot

Implementation with a progress bar #4