uktrade / stream-unzip

Python function to stream unzip all the files in a ZIP archive on the fly
https://stream-unzip.docs.trade.gov.uk/
MIT License
277 stars 12 forks source link

20gb zip of type Zip64 results in multiple errors #49

Closed kapenga closed 1 year ago

kapenga commented 1 year ago

The code:

from stream_unzip import stream_unzip

chunk_size = 1 << 16

with open('big20gb.zip', 'rb') as f:
    for file_name, file_size, unzipped_chunks in stream_unzip(iter(lambda: f.read(chunk_size), b''), chunk_size=chunk_size):
        # unzipped_chunks must be iterated to completion or UnfinishedIterationError will be raised
        for chunk in unzipped_chunks:
            print(chunk)

results in multiple errors. The 'best_matches' bool array has True, False, False, False, False as values. Resulting in a:

To be sure I tested the zip with 7zip on integrity and the application found no errors in the zip. Normal Python zip library works too on the file but there is another zip in this zip and it has problems parsing that one in memory so I was hoping on this library to provide a streaming version for that issue.

Is there something wrong in my code? Am I using the library wrong? Sadly I really can't provide the zip file because it contains a lot of sensitive information. Thanks for your answer!

Edit: After some more debugging I found out that it fails to detect the compressed and uncompressed file size. It therefor does not register it as a Zip64 file. It reads 1 64kb chunk and fails.

michalc commented 1 year ago

Hi,

It'll be tricky to debug without the file itself, but: do you have any information on how it was generated?

Also, can you post the whole stack trace with the errors in? (I'm especially wondering how multiple errors seemed to be raised)

Thanks,

Michal

michalc commented 1 year ago

I realise I didn’t answer your questions!

Is there something wrong in my code? Am I using the library wrong?

No - not as far as I can tell

michalc commented 1 year ago

To be sure I tested the zip with 7zip on integrity and the application found no errors in the zip. Normal Python zip library works too

I wonder if there is something about the file that makes it not friendly to streaming - the format is a tricky one and not all ZIPs can be stream-read...

Is it parseable by other stream readers? For example the libarchive+Python one defined in https://stackoverflow.com/a/74986842/1319998, or using cpio via something like

cat big20gb.zip | cpio -i

or https://github.com/madler/sunzip?

kapenga commented 1 year ago

Thanks for the fast and very extensive answer!

Normally I do not invest a lot of time in these problems and choose a more rational approach like in this case using a temporal file on the HD as an intermediate. But since you invested a lot of time helping me I could not keep behind.

First I tried to recreate a problematic Zip64 archive. This failed. Every archive I made worked just fine in your library. I tried to find out the source (archiver) of the file but that's a bit of a problem. The government organization I work for gets data from all sorts of organizations and it follows a long string of contacts before it ends up on my desk.

So I tried your suggestions. I am not a big C(++) hero and more of a Windows user (shame on me ;) )

So I used the next best thing I could think of to test the streaming abilities of another respected archive library -> c#

My first attempt was:

            FileStream fis = File.OpenRead("big20gb.zip");

            ZipArchive archive = new ZipArchive(fis);
            foreach (ZipArchiveEntry entry in archive.Entries)
            {
                if (entry.Name.EndsWith(".zip"))
                {
                    ZipArchive subArchive = new ZipArchive(entry.Open());
                    foreach (ZipArchiveEntry subEntry in subArchive.Entries)
                    {
                        Console.WriteLine(subEntry.Name);
                    }
                }
            }
            fis.Close();

However the build in .Net library 'abuses' the fact that the FileStream can be randomly read so reading goes okay. Opening the archive in the zip goes wrong because it's longer than 2gb and ZipArchive uses a hidden MemoryStream to open that one.

My second attempt was SharpCompress:

            FileStream fis = File.OpenRead("big20gb.zip");

            var reader = ReaderFactory.Open(fis);
            while (reader.MoveToNextEntry())
            {
                if (!reader.Entry.IsDirectory)
                {
                    Stream s = reader.OpenEntryStream();
                    try
                    {
                        int len;
                        byte[] buffer = new byte[1024];
                        while((len = s.Read(buffer, 0, buffer.Length)) > 0)
                        {
                            Console.WriteLine(len);
                        }
                    }
                    catch(Exception e) {
                        Console.WriteLine(e);
                    }
                }
            }
            fis.Close();

This one was more interesting. Opening the zip file and reading the entries is okay. But parsing the content goes wrong. The big inner zip file is also 0 bytes according to SharpCompress and SharpCompress decides to return 0 bytes from the Stream because it's empty anyway. That's a bit of cheating but is very similar to what your library is able to make of the zip file. Also the Crc = 0.

Anyway... It seems that this file is somehow not streamable. Other packages fail too.

I think I am going to implement the temp file solution and thanks for your time.

michalc commented 1 year ago

It seems that this file is somehow not streamable. Other packages fail too.

Understood. In that case, will close