uktrade / stream-unzip

Python function to stream unzip all the files in a ZIP archive on the fly
https://stream-unzip.docs.trade.gov.uk/
MIT License
268 stars 11 forks source link

Infinite loop using stream_unzip() #82

Open DwayneDriskill opened 4 months ago

DwayneDriskill commented 4 months ago

Hi,

I am using stream-unzip 0.0.91 to unzip a file that was zipped using the deflate64 algorithm. The file was zipped using Windows 11's built in zipper/unzipper. I am able to unzip the file using Windows 11's built in zipper/unzipper. The files look fine after being unzipped so I don't think the files are corrupt. Unfortunately, I cannot share the file because it contains data that I am not allowed to share. I can tell you that the zipped file contains 7 .json files. 6 of the files unzip without any issues. stream_unzip() gets stuck in an infinite loop when trying to unzip the 7th file, which is over 2GB. I confirmed this with logging and by letting the program run for several hours to confirm that the size of the unzipped file was not increasing and that stream_unzip() was iterating over the same code. I stepped into the code for stream_unzip(). But I don't have time to understand the code well enough myself to get to the point where I can pinpoint the issue and it's using yield in several places, and I have never used yield myself. My code is below. Perhaps I'm making a mistake in how I'm using stream_unzip(). def _unzip_file_using_stream_unzip(self, file_path, temp_dir): with open(file_path, 'rb') as f: seen_files = set() # to search for duplicate file names. This is specific to my use case. Theoretically, a zip file can have 2 files with the same name in different folders and my code doesn't handle that scenario. unzipper = stream_unzip(f) for file_info in unzipper: file_name, file_size, file_content_gen = file_info file_name = file_name.decode('utf-8') if not file_name.endswith('/'): # I borrowed this code from a fxn I wrote to unzip files using the python standard library. I'm using Python 3.11. This identifies a folder. I'm not sure if stream_unzip() works the same way. But the code is not getting stuck here. if os.path.basename(file_name.lower()) in seen_files: raise RuntimeError("Error: There are multiple files named '" + f"{os.path.basename(file_name)}' in {file_path}.") seen_files.add(os.path.basename(file_name.lower())) extracted_file_path = os.path.join(temp_dir, os.path.basename(file_name)) with open(extracted_file_path, 'wb') as out_file: buffered_writer = io.BufferedWriter(out_file, buffer_size=65536)

I proved with logging that when PROD_04182024_EXPORTS.zip is passed in,

                # this code gets stuck in an infinite loop (in the for chunk loop). 6 of the 7 files
                # are successfully unzipped. The 7th file is the one that causes the infinite loop.
                # The following 3 lines execute in an infinite loop. The value of chunk changes during each iteration over the loop. I did not confirm the following - but I assume that at some point, the value of chunk gets set back to the original value that it's set to the first time it enters the infinite loop.
                for chunk in file_content_gen:
                    buffered_writer.write(chunk)
                buffered_writer.flush()  # ensure all data is written to disk. This might not be necessary. I added after I identified the infinite loop hoping that this would fix it.
            self._logger.info("Finished unzipping %s", file_name)  # This executes for the 1st 6 files. But not the 7th file.

The size of the file that contains the unzipped data got stuck at 57,222kb. I'm assuming the loop is infinite. The file unzips using Windows 11 in less than 5 minutes. I let the program run for at least 2 hours before I stopped it. I also tried running the code in a Docker container running linux and also in an aws batch process (also running linux) with the same results. I paused the program in PyCharm and copied the call stack: _paginate, stream_inflate.py:315 _run, stream_inflate.py:322 _decompress, stream_unzip.py:146 decrypt_none_decompress, stream_unzip.py:281 (279) _iter, stream_unzip.py:295 checked_from_local_header, stream_unzip.py:304 _unzip_file_using_stream_unzip, aws_utils.py:474 unzip_file, aws_utils.py:432 load_new_files, aws_utils.py:279

I noticed that _run() was also calling other methods at different points such as _get_bit(), inflate(get_bits, get_bytes, yield_bytes), and get_bits(num_bits). I noticed that some exceptions were being thrown and caught: def up_to_page_size(num): nonlocal chunk, offset

            while num:
                if offset == len(chunk):
                    try:
                        chunk = next(it)
                   except StopIteration:

break Another exception: def get_byte_readers(iterable):

Return functions to return/"replace" bytes from/to the iterable

- _yield_all: yields chunks as they come up (often for a "body")

- _get_num: returns a single bytes of a given length

- _return_num_unused: puts a number of "unused" bytes "back", to be retrieved by a yield/get call

- _return_bytes_unused: return a bytes instance "back" into the stream, to be retrieved later

- _get_offset_from_start: get the zero-indexed offset from the start of the stream

    chunk = b''
    offset = 0
    offset_from_start = 0
    queue = list()  # Will typically have at most 1 element, so a list is fine
    it = iter(iterable)

    def _next():
        try:
            return queue.pop(0)
       except IndexError:

return (next_or_truncated_error(it), 0)

This is not an urgent issue for me. But I thought you would want to know. I'm happy to help debug the issue if it's important enough for you to spend time on it. Thanks,

Dwayne

michalc commented 4 months ago

Hi @DwayneDriskill,

Thanks for the report... I think we would like to sort it, but suspect it'll be tricky without a reproducible case. Is there any chance you can create a file with data that can be shared that shows the issue?

(Also, suspect the issue is most likely to be in https://github.com/michalc/stream-inflate, since it handles the Deflate64 aspect. Maybe it could be changed to detect if it gets stuck in an infinite loop? Not sure...)

Thanks,

Michal

michalc commented 4 months ago

Two other things...

The size of the file that contains the unzipped data got stuck at 57,222kb. I'm assuming the loop is infinite.

Does the input iterable definitely not progress during this? (Wondering if it's extremely poor performance, rather than an infinite loop)

And this code I realise is unexpected:

with open(file_path, 'rb') as f:
    unzipper = stream_unzip(f)

This works (er, other than the issue you're encountering), because f here is an iterable of bytes. But, each value of the iterable is a "line", and so ending with the newline character \n - which would happen at very arbitrary places in a binary file like a zip file. Testing a deflate64 file I have, this makes each chunk quite short - so it could have a negative performance impact. Not sure if enough to make it seem like it's gotten stuck, but something maybe to rule out.

Instead, can you try this and see if you get the same behaviour?

with open(file_path, 'rb') as f:
    unzipper = stream_unzip(iter(lambda: f.read(65536), b''))

It'll make each input chunk 64k, ignoring where \n characters happen to be in the input data

michalc commented 4 months ago

Does the input iterable definitely not progress during this? (Wondering if it's extremely poor performance, rather than an infinite loop)

I am starting to suspect extremely poor performance when unzipping deflate64 - taking approximately 1000 the amount of time in some cases...

This is from https://github.com/uktrade/stream-unzip/pull/83, which (just for investigative purposes) uses https://github.com/michalc/stream-inflate for everything, which in the published stream-unzip is only used for defalte64 files. The test_infozip_zip64_with_descriptors test specifically takes much much longer

ddriskillcssi commented 4 months ago

Hi Michal,

You were right that it seems to be very poor performance. I made the change you suggested and let it run overnight. It finished in about 8.5 hours. When I unzip it with the Windows 11 built in tool it just takes about a minute. Thanks,

Dwayne

michalc commented 4 months ago

Ah good to know... in a way at least!

Now to solve the performance problem...

michalc commented 1 week ago

Hi @DwayneDriskill / @ddriskillcssi,

To give an update on this, have made some performance improvements to https://github.com/michalc/stream-inflate. From some rough tests, v0.0.18 increases the speed of unzipping deflate64 files about 50% compared to v0.0.15. I know this is still pretty slow, especially for larger files. Might need some sort of more drastic re-architecture... or maybe throwing Cython at it...

Michal

michalc commented 1 week ago

And some more speed improvements in v0.0.19. Actually for "easy" deflate64 files it increases speed by a factor of 50 or so compared to v0.0.15. (Easy files would be roughly those with a lot of repetition close in the file)

michalc commented 1 week ago

A few more speed improvements done in stream-inflate (which is now up to 0.0.23)