uktrade / stream-unzip

Python function to stream unzip all the files in a ZIP archive on the fly
https://stream-unzip.docs.trade.gov.uk/
MIT License
269 stars 11 forks source link

Archive encrypted with Zip Crypto algorithm (weak encrypt) is extreamly slow under stream unzip #91

Open auto-Dog opened 1 month ago

auto-Dog commented 1 month ago

I use this tool to streamly unzip zip archive. Some of them are encrypted with Zip Crypto algorithm. I see this might triggers weak decrypter in stream_unzip.py: https://github.com/uktrade/stream-unzip/blob/4e19403af143cf59adbcba1f96857a5a0d8d2838/stream_unzip.py#L211-L221

However, the running efficiency is extreamly low using such python loop (approximately 5MB/minute). Any way to speed up?

michalc commented 3 weeks ago

Probably yes there is a way to speed it up.

But 5MB/minute is slower than I would expect even for the code as it is right now. Do you have a short snippet of code that I could run to show it is that slow?

michalc commented 3 weeks ago

Ah he's an example zipping a 100MB file of pseudo-random data, so pretty much the worst case in terms of compression:

import datetime
import subprocess
import random

from stream_unzip import stream_unzip

# Always deal with 65 KiB
max_chunk = 65536

# Create 100MB file of pseudo-random data
print('Creating uncompressed file...')
total = 100_000_000
remaining = total
random.seed(0)
with open('random.txt', 'wb') as f:
    while remaining:
        chunk_size = min(max_chunk, remaining)
        f.write(random.randbytes(chunk_size))
        remaining -= chunk_size
print('Done')

# ZIP the file
print('Creating password-protected ZIP...')
subprocess.check_output(['zip', '-P', 'mypassword', 'random.zip', 'random.txt'])
print('Done')

# UnZIP
print('Unzipping with stream_unzip')
start = datetime.datetime.now()
with open('random.zip', 'rb') as f:
    zipped_chunks =  iter(lambda: f.read(max_chunk), b'')
    for file_name, size, chunks in stream_unzip(zipped_chunks, password=b'mypassword'):
        for _ in chunks:
            pass
end =  datetime.datetime.now()
taken = end - start
print('Done:', taken)

For me, the unzipping takes just under a minute, so it's more like 100MB/min. Not the speediest thing in the world, but more than an order of magnitude faster than 5MB/min. (And I'm just on a fairly regular laptop I think?)

So it would be good to see an example where it's 5MB/min

michalc commented 3 weeks ago

Comparing with Python's zipfile, zipfile is about 10% faster than stream_unzip for me

print('Unzipping with zipfile')
start = datetime.datetime.now()
with zipfile.ZipFile('random.zip') as myzip:
    myzip.setpassword(b'mypassword')
    with myzip.open('random.txt') as f:
        unzipped_chunks = iter(lambda: f.read(chunk_size), b'')
        for _ in unzipped_chunks:
            pass
end = datetime.datetime.now()
taken = end - start
print('Done:', taken)

So while stream_unzip maybe could probably be made faster (if zipfile can do it, why not stream_unzip?), I am suspecting the 5MB/min pain is from something else somehow?

michalc commented 3 weeks ago

Found a few ways to improve stream_unzip's ZipCrypto decrypting: https://github.com/uktrade/stream-unzip/pull/92, changing it from ~10% slower than Python's zipfile, to ~10% faster, at least for my tests

michalc commented 3 weeks ago

https://github.com/uktrade/stream-unzip/pull/92 is now released in v0.0.92

michalc commented 3 weeks ago

One thing crosses my mind... could the Zip Crypto thing be a red herring? Could the 5MB/min in fact be due to the file using Deflate64, which is known to be incredible slow in stream-unzip: https://github.com/uktrade/stream-unzip/issues/82