oconnor663 / blake3-py

Python bindings for the BLAKE3 cryptographic hash function
Other
139 stars 12 forks source link

Recommendations for performance #14

Closed jpmckinney closed 3 years ago

jpmckinney commented 3 years ago

I'm making a hash of all files in a directory as follows. The total size might be GBs. Individual files are mostly hundreds of KBs, but can be MBs or GBs in some cases (in which case I don't want to read them entirely into memory).

hasher = blake3()
for root, dirs, files in os.walk(directory):
    dirs.sort()
    for file in sorted(files):
        with open(os.path.join(root, file), 'rb') as f:
            for chunk in iter(partial(f.read, 8192), b''):
                hasher.update(chunk)
return hasher.hexdigest()

I'm wondering:

The readme has "As a rule of thumb, don't use multithreading for inputs shorter than 1 MB", so I'm also wondering if I should read in a large-ish chunk size and then use hasher.update(chunk, multithreading=True) – or if it's faster to read small chunks without multithreading. There might be an obvious answer...

oconnor663 commented 3 years ago

Important side questions first: If you're effectively concatenating all the files together, do you need to worry about ambiguous cases where the end of one file is moved to the beginning of another? Similarly, do you care if filenames change? Do you care about iterating over them in a stable order? Etc.

Back to your question: I might recommend a buffer size of 16384 bytes (2^14). This is the smallest buffer size where a machine that supports AVX-512 is able to make full use of 512-bit vectors to hash 16 chunks in parallel on a single thread. That's the largest vector size currently available on Intel hardware, though it's pretty new, and most consumer hardware doesn't support it yet. (AWS servers do, though, so you'll see substantial differences in your benchmarks in the cloud vs on your laptop for that reason.)

I'm also wondering if I should read in a large-ish chunk size and then use hasher.update(chunk, multithreading=True) – or if it's faster to read small chunks without multithreading. There might be an obvious answer...

You really have to benchmark it for your use case on your hardware. Especially once multithreading gets involved, there are just too many different factors to give a good general answer. BLAKE3 is fast enough that limits on IO bandwidth often become an issue, and there are many, many factors that affect IO. Here are some considerations:

For most applications, a small-ish fixed size like 16384 is probably fine. Or if you want, you could copy b3sum's simple heuristic of "try to memory-map anything >= 16 KiB, and use multithreading on anything that successfully memory-maps." (The performance downsides of using multithreading on e.g. a 32 KiB file are masked by the startup time of b3sum, so we don't bother to optimize that mid-sized case. But your application might be different!)

jpmckinney commented 3 years ago

Important side questions first: If you're effectively concatenating all the files together, do you need to worry about ambiguous cases where the end of one file is moved to the beginning of another? Similarly, do you care if filenames change? Do you care about iterating over them in a stable order? Etc.

Thanks for all your pointers! In my case:

I'll start with a buffer of 16384 bytes in a single thread. As you say, the bottleneck will likely be I/O.

Thanks again for explaining the different considerations!

xkortex commented 3 years ago

Just wanted to add, I cycled through various powers of 2 chunk sizes, and this is what I settled on for a file hashing function. 2^22 is 4194 MiB. That size seemed to work best for me on my i7 macbook.

def blake3sum_file(filename, chunksize=2 ** 22):
    hasher = blake3()
    with open(filename, 'rb') as fp:
        hasher.update(fp.read(chunksize), multithreading=True)

    return hasher.hexdigest(), filename

If you're absolutely concerned about processing thousands of files, you may be interested in instead using the b3sum binary from the Rust blake3 repo and calling that via subprocess, just speculation.

oconnor663 commented 3 years ago

@xkortex for reference:

Unless your macbook is quite old, I'd be surprised if it didn't support AVX2, and so I'm also surprised that you didn't see a noticeable performance difference going from 4096 to 8192. (EDIT: I now realize you were talking about _mega_bytes above, so this sentence doesn't make sense.) But as we've been discussing, various measurement issues could be masking these effects.

oconnor663 commented 3 years ago

For my use case (valid JSON objects/arrays), I don't need to disambiguate the case where data is moved from the end of a file to the start of another file, since that would mean each file has invalid JSON content.

@jpmckinney interesting use case! Yes, I think JSON objects and arrays being self-delimiting helps you here. But notably (sounds like you might've already thought about this) JSON integers are not self-delimiting. Presumably your JSON parsing code is happy to read and write bare integers, and it's up to application code to avoid emitting them or to reject them after parsing? In that case, what looks like run-of-the-mill application code might actually be enforcing important security requirements, and this is the sort of thing that can break over time under maintenance. Might at least warrant some XXX: comments :)

jpmckinney commented 3 years ago

I'll add some comments :) The hashing is part of an algorithm to decide whether to archive a dataset or not (if it's identical to a dataset already archived, we skip it), so it doesn't end up being a security issue.

Having said that, I just realized that I can use non-cryptographic hashes for this use case, like xxHash used by LZ4.

xkortex commented 3 years ago

Alright, I just can't help my self, I profiled different chunk sizes for grinnies. This is a 2-year old macbook pro with ssd and these cpu features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX1.0

Function:

def blake3sum_file(filename, chunksize=2 ** 22):
    with lock: 
        start = time.time()
        hasher = blake3.blake3()
        with open(filename, 'rb') as fp:
            while True:
                data = fp.read(chunksize)
                if not data:
                    break
                hasher.update(data, multithreading=True)
        elapsed = time.time() - start
        return hasher.hexdigest(), elapsed

Data is urandom blobs on ssd. So in fact 2**20 seems to be the sweet spot here.

image

oconnor663 commented 3 years ago

@xkortex a megabyte buffer for multi-threaded hashing sounds reasonable to me, yeah. I'd be careful with fp.read though: that's allocating a new buffer every time, which could be some non-trivial allocation/caching overhead given your large buffer size. You might want to play around with the readinto method and reusing the same buffer repeatedly, to see if it affects your results.

But yeah, at these giant buffer sizes, we start to see why memory mapping is helpful here. Waiting for a single thread to fill a megabyte buffer (not to mention a gigabyte buffer) means spending a lot of time with the rest of your threads/cores doing nothing.

xkortex commented 3 years ago

I gave readinto a shot. Not only did I not see any major difference in performance, I also started seeing nondeterminism at larger sizes, even with the lock.

def blake3sum_file_into(filename, chunksize=2**22):
    with lock: 
        start = time.time()
        buf = bytearray(chunksize)
        hasher = blake3.blake3()
        with open(filename, 'rb') as fp:
            while True:
                sz = fp.readinto(buf)
                if not sz:
                    break
                hasher.update(buf, multithreading=True)
        end = time.time()
        elapsed = end - start
        return hasher.hexdigest(), expected_hash[filename],  elapsed

image

image

oconnor663 commented 3 years ago

Interesting. It's possible there's genuinely no effect here. (Shooting from the hip: Could be because your allocator is good at reusing memory when you have short-lived allocations in a loop like this?) I did notice that this line is probably a bug:

hasher.update(buf, multithreading=True)

Because it passes in the entire buf rather than just the first sz bytes of it. To get that right in Python without accidentally introducing another allocation by slicing, you have to start using memoryview. But I assume sz == len(buf) most of the time for us here, so it probably isn't affecting the results by much.