Recommendations for performance

jpmckinney commented 3 years ago

I'm making a hash of all files in a directory as follows. The total size might be GBs. Individual files are mostly hundreds of KBs, but can be MBs or GBs in some cases (in which case I don't want to read them entirely into memory).

hasher = blake3()
for root, dirs, files in os.walk(directory):
    dirs.sort()
    for file in sorted(files):
        with open(os.path.join(root, file), 'rb') as f:
            for chunk in iter(partial(f.read, 8192), b''):
                hasher.update(chunk)
return hasher.hexdigest()

I'm wondering:

What are the recommendation around chunk size? I'm not looking for optimal performance, but I also don't want to choose a chunk size that is obviously worse to someone who knows blake3 better.

The readme has "As a rule of thumb, don't use multithreading for inputs shorter than 1 MB", so I'm also wondering if I should read in a large-ish chunk size and then use hasher.update(chunk, multithreading=True) – or if it's faster to read small chunks without multithreading. There might be an obvious answer...

oconnor663 commented 3 years ago

Important side questions first: If you're effectively concatenating all the files together, do you need to worry about ambiguous cases where the end of one file is moved to the beginning of another? Similarly, do you care if filenames change? Do you care about iterating over them in a stable order? Etc.

Back to your question: I might recommend a buffer size of 16384 bytes (2^14). This is the smallest buffer size where a machine that supports AVX-512 is able to make full use of 512-bit vectors to hash 16 chunks in parallel on a single thread. That's the largest vector size currently available on Intel hardware, though it's pretty new, and most consumer hardware doesn't support it yet. (AWS servers do, though, so you'll see substantial differences in your benchmarks in the cloud vs on your laptop for that reason.)

I'm also wondering if I should read in a large-ish chunk size and then use hasher.update(chunk, multithreading=True) – or if it's faster to read small chunks without multithreading. There might be an obvious answer...

You really have to benchmark it for your use case on your hardware. Especially once multithreading gets involved, there are just too many different factors to give a good general answer. BLAKE3 is fast enough that limits on IO bandwidth often become an issue, and there are many, many factors that affect IO. Here are some considerations:

First of all, you might want to make sure that hashing performance really matters for your application in practice. The most common reason it wouldn't matter is if you tend to hash files that aren't cached in RAM. (As opposed to files that have been read recently, like if you hash them repeatedly.) If not, you're probably bottlenecked by disk read throughput, and these CPU performance optimizations might not matter.
If you're going to read a large buffer into memory (from a file that's cached in RAM), the time it takes to copy that buffer can end up being a bottleneck too given enough worker threads. I don't know much about memory bandwidth limits myself, but I think essentially you have many worker threads blocked waiting on a single main thread to do the copy, and that's never a good thing. The best way to solve this issue is to memory-map the file, in which case you no longer have to worry about a buffer size. This is what b3sum does.

For most applications, a small-ish fixed size like 16384 is probably fine. Or if you want, you could copy b3sum's simple heuristic of "try to memory-map anything >= 16 KiB, and use multithreading on anything that successfully memory-maps." (The performance downsides of using multithreading on e.g. a 32 KiB file are masked by the startup time of b3sum, so we don't bother to optimize that mid-sized case. But your application might be different!)

jpmckinney commented 3 years ago

Important side questions first: If you're effectively concatenating all the files together, do you need to worry about ambiguous cases where the end of one file is moved to the beginning of another? Similarly, do you care if filenames change? Do you care about iterating over them in a stable order? Etc.

For my use case (valid JSON objects/arrays), I don't need to disambiguate the case where data is moved from the end of a file to the start of another file, since that would mean each file has invalid JSON content. A good catch, though!
The code does try to iterate over files and directories in a stable order. However, if a file is renamed and its position changes, then the hash will be different even though the data is the same; I'm presently assuming that this case is rare, and I'm okay with miscalculating the checksum in this case. An alternative might be to compare two lists of file checksums, but in some directories I might have millions of files, and I want the comparison to be fast.

Thanks for all your pointers! In my case:

The CPU's flags include AVX-512 (Intel Xeon W-2145).
The files will not have been cached in RAM.
The SSDs are SATA 3.2 (up to 6.0 Gb/s) in case that changes things.

I'll start with a buffer of 16384 bytes in a single thread. As you say, the bottleneck will likely be I/O.

Thanks again for explaining the different considerations!

xkortex commented 3 years ago

Just wanted to add, I cycled through various powers of 2 chunk sizes, and this is what I settled on for a file hashing function. 2^22 is 4194 MiB. That size seemed to work best for me on my i7 macbook.

def blake3sum_file(filename, chunksize=2 ** 22):
    hasher = blake3()
    with open(filename, 'rb') as fp:
        hasher.update(fp.read(chunksize), multithreading=True)

    return hasher.hexdigest(), filename

If you're absolutely concerned about processing thousands of files, you may be interested in instead using the b3sum binary from the Rust blake3 repo and calling that via subprocess, just speculation.

oconnor663 commented 3 years ago

@xkortex for reference:

4096 bytes is enough to take advantage of SSE4.1. (These are 128-bit vectors, and we use 32-bit words, so that's 4 words per vector, which means 4 chunks in parallel, which given our 1024-byte chunk size means 4 * 1024 = 4096 bytes of buffer needed.)
8192 bytes is enough to take advantage of AVX2. (Ditto with 256-bit vectors.)
16384 bytes is enough to take advantage of AVX-512. (Ditto with 512-bit vectors.)

Unless your macbook is quite old, I'd be surprised if it didn't support AVX2, and so I'm also surprised that you didn't see a noticeable performance difference going from 4096 to 8192. (EDIT: I now realize you were talking about _mega_bytes above, so this sentence doesn't make sense.) But as we've been discussing, various measurement issues could be masking these effects.

oconnor663 commented 3 years ago

For my use case (valid JSON objects/arrays), I don't need to disambiguate the case where data is moved from the end of a file to the start of another file, since that would mean each file has invalid JSON content.

@jpmckinney interesting use case! Yes, I think JSON objects and arrays being self-delimiting helps you here. But notably (sounds like you might've already thought about this) JSON integers are not self-delimiting. Presumably your JSON parsing code is happy to read and write bare integers, and it's up to application code to avoid emitting them or to reject them after parsing? In that case, what looks like run-of-the-mill application code might actually be enforcing important security requirements, and this is the sort of thing that can break over time under maintenance. Might at least warrant some XXX: comments :)

jpmckinney commented 3 years ago

I'll add some comments :) The hashing is part of an algorithm to decide whether to archive a dataset or not (if it's identical to a dataset already archived, we skip it), so it doesn't end up being a security issue.

Having said that, I just realized that I can use non-cryptographic hashes for this use case, like xxHash used by LZ4.

xkortex commented 3 years ago

Alright, I just can't help my self, I profiled different chunk sizes for grinnies. This is a 2-year old macbook pro with ssd and these cpu features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX1.0

Function:

def blake3sum_file(filename, chunksize=2 ** 22):
    with lock: 
        start = time.time()
        hasher = blake3.blake3()
        with open(filename, 'rb') as fp:
            while True:
                data = fp.read(chunksize)
                if not data:
                    break
                hasher.update(data, multithreading=True)
        elapsed = time.time() - start
        return hasher.hexdigest(), elapsed

Data is urandom blobs on ssd. So in fact 2**20 seems to be the sweet spot here.

oconnor663 commented 3 years ago

@xkortex a megabyte buffer for multi-threaded hashing sounds reasonable to me, yeah. I'd be careful with fp.read though: that's allocating a new buffer every time, which could be some non-trivial allocation/caching overhead given your large buffer size. You might want to play around with the readinto method and reusing the same buffer repeatedly, to see if it affects your results.

But yeah, at these giant buffer sizes, we start to see why memory mapping is helpful here. Waiting for a single thread to fill a megabyte buffer (not to mention a gigabyte buffer) means spending a lot of time with the rest of your threads/cores doing nothing.

xkortex commented 3 years ago

I gave readinto a shot. Not only did I not see any major difference in performance, I also started seeing nondeterminism at larger sizes, even with the lock.

def blake3sum_file_into(filename, chunksize=2**22):
    with lock: 
        start = time.time()
        buf = bytearray(chunksize)
        hasher = blake3.blake3()
        with open(filename, 'rb') as fp:
            while True:
                sz = fp.readinto(buf)
                if not sz:
                    break
                hasher.update(buf, multithreading=True)
        end = time.time()
        elapsed = end - start
        return hasher.hexdigest(), expected_hash[filename],  elapsed

oconnor663 commented 3 years ago

Interesting. It's possible there's genuinely no effect here. (Shooting from the hip: Could be because your allocator is good at reusing memory when you have short-lived allocations in a loop like this?) I did notice that this line is probably a bug:

hasher.update(buf, multithreading=True)

Because it passes in the entire buf rather than just the first sz bytes of it. To get that right in Python without accidentally introducing another allocation by slicing, you have to start using memoryview. But I assume sz == len(buf) most of the time for us here, so it probably isn't affecting the results by much.

oconnor663 / blake3-py

Recommendations for performance #14