Progress bar for zip file indexing

The newly added zip file table to SQLite index conversion does not have a progress bar, but/and it still can take quite a long time, e.g., for this real dataset. This dataset is also a nice benchmark for rapidgzip integration for zip because all of the files are deflate compressed:

> zipinfo rsna-intracranial-hemorrhage-detection.zip | grep -v defN
Archive:  rsna-intracranial-hemorrhage-detection.zip
Zip file size: 194671162157 bytes, number of entries: 874037
874037 files, 458972270355 bytes uncompressed, 194442560761 bytes compressed:  57.6%

195 GB -> 459 GB

Also, there is basically no CPU usage to speak of but with Ctrl+C, I do see it breaking from:

  File "ratarmountcore/ZipMountSource.py", line 323, in _createIndex
    fileInfos.append(self._convertToRow(info))
  File "ratarmountcore/ZipMountSource.py", line 292, in _convertToRow
    dataOffset = self._findDataOffset(info.header_offset)
  File "ratarmountcore/ZipMountSource.py", line 264, in _findDataOffset
    LocalFileHeader(self.rawFileObject)
  File "ratarmountcore/ZipMountSource.py", line 161, in __init__
    result = LocalFileHeader.FixedLocalFileHeader.unpack(fileObject.read(LocalFileHeader.FixedLocalFileHeader.size))
KeyboardInterrupt

There probably is a lot of optimization that can still be done for this, e.g., SQLite insertion batching or maybe moving the zip metadata reading to native code, e.g., integrate it in rapidgzip.

Finally done:

> time ratarmount rsna-intracranial-hemorrhage-detection.zip
Creating new SQLite index database at rsna-intracranial-hemorrhage-detection.zip.index.sqlite
Creating offset dictionary for rsna-intracranial-hemorrhage-detection.zip ...
Creating offset dictionary for rsna-intracranial-hemorrhage-detection.zip took 993.41s

real    16m39.042s
user    0m24.294s
sys 0m9.148s

> time find rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection//stage_2_test/ | wc -l
121233

real    0m1.540s
user    0m0.041s
sys 0m0.040s

More granular timings show that SQLite index insertion is definitely not the bottleneck:

Create SQLite database for 874037 items
zipfile.infolist() took 0.000s
conversion to file infos took 1060.196s
conversion to file infos without calling findDataOffset took 2.401s
conversion to file infos without calling struct.Struct inside findDataOffset took 
insertion into index took 6.220s

The problem seems to be that we have to seek to each local file header and read over the header to find out the data offset. Of course, this will take some time, especially on slow hard drives.

I'm not sure whether the data offset can be reliably determined from the central directory at the end of the zip, which is probably also what zipfile reads. Theoretically, it should be redundant to the local headers but it feels unreliable. The data offset currently is at the header offset + some fixed bytes + file name length + extra field length + 12 B encryption header if it is encrypted. The file name should probably be identical in the central directory. The extra field length may for example contain the 64-bit size in the zip64 format. The file name might differ:

4.4.17.2 If using the Central Directory Encryption Feature and general purpose bit flag 13 is set indicating masking, the file name stored in the Local Header will not be the actual file name.
A masking value consisting of a unique hexadecimal value will be stored.
Alternatively, we could store only the header offset and read over the header on ZipMountSource.open, i.e., a kind of lazy evaluation but it should suffice. It will not be downwards-compatible though.
- It seems like the data offset is unused anyway ... The file is opened via the ZipInfo corresponding to the offsetheader! The data offset was probably only intended for future use with a manual reading of file members, even though it is hazardous because these values are basically untested!

So yeah, seems like setting all data offsets to 0 would be the best option and deprecate them. They are useless anyway without knowing the compression / encryption that was applied to the actual data, and this information is not stored in the index.

mxmlnkn / ratarmount

Progress bar for zip file indexing #121