mxmlnkn / ratarmount

Access large archives as a filesystem efficiently, e.g., TAR, RAR, ZIP, GZ, BZ2, XZ, ZSTD archives
MIT License
709 stars 36 forks source link

Progress bar for zip file indexing #121

Closed mxmlnkn closed 7 months ago

mxmlnkn commented 1 year ago

The newly added zip file table to SQLite index conversion does not have a progress bar, but/and it still can take quite a long time, e.g., for this real dataset. This dataset is also a nice benchmark for rapidgzip integration for zip because all of the files are deflate compressed:

> zipinfo rsna-intracranial-hemorrhage-detection.zip | grep -v defN
Archive:  rsna-intracranial-hemorrhage-detection.zip
Zip file size: 194671162157 bytes, number of entries: 874037
874037 files, 458972270355 bytes uncompressed, 194442560761 bytes compressed:  57.6%

195 GB -> 459 GB

Also, there is basically no CPU usage to speak of but with Ctrl+C, I do see it breaking from:

  File "ratarmountcore/ZipMountSource.py", line 323, in _createIndex
    fileInfos.append(self._convertToRow(info))
  File "ratarmountcore/ZipMountSource.py", line 292, in _convertToRow
    dataOffset = self._findDataOffset(info.header_offset)
  File "ratarmountcore/ZipMountSource.py", line 264, in _findDataOffset
    LocalFileHeader(self.rawFileObject)
  File "ratarmountcore/ZipMountSource.py", line 161, in __init__
    result = LocalFileHeader.FixedLocalFileHeader.unpack(fileObject.read(LocalFileHeader.FixedLocalFileHeader.size))
KeyboardInterrupt

There probably is a lot of optimization that can still be done for this, e.g., SQLite insertion batching or maybe moving the zip metadata reading to native code, e.g., integrate it in rapidgzip.

Finally done:

> time ratarmount rsna-intracranial-hemorrhage-detection.zip
Creating new SQLite index database at rsna-intracranial-hemorrhage-detection.zip.index.sqlite
Creating offset dictionary for rsna-intracranial-hemorrhage-detection.zip ...
Creating offset dictionary for rsna-intracranial-hemorrhage-detection.zip took 993.41s

real    16m39.042s
user    0m24.294s
sys 0m9.148s

> time find rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection//stage_2_test/ | wc -l
121233

real    0m1.540s
user    0m0.041s
sys 0m0.040s

More granular timings show that SQLite index insertion is definitely not the bottleneck:

Create SQLite database for 874037 items
zipfile.infolist() took 0.000s
conversion to file infos took 1060.196s
conversion to file infos without calling findDataOffset took 2.401s
conversion to file infos without calling struct.Struct inside findDataOffset took 
insertion into index took 6.220s

The problem seems to be that we have to seek to each local file header and read over the header to find out the data offset. Of course, this will take some time, especially on slow hard drives.

So yeah, seems like setting all data offsets to 0 would be the best option and deprecate them. They are useless anyway without knowing the compression / encryption that was applied to the actual data, and this information is not stored in the index.

mxmlnkn commented 7 months ago

Slow code removed in fb058eb. A real solution should be implemented in rapidgzip, see https://github.com/mxmlnkn/rapidgzip/issues/23 .