srikanth-lingala / zip4j

A Java library for zip files and streams
Apache License 2.0
2.01k stars 307 forks source link

Improve header performance #457

Open slp091020 opened 1 year ago

slp091020 commented 1 year ago

I have a use case for storing many tiny files using zip4j to pack and make them easier to backup and move around. Currently there are close to 100M files split across 256 archives by the first two nibbles of the file's SHA1. Periodic updates involve scanning a directory of loose files and appending them to existing archives using zip4j.ZipFile.addFiles().

Examining performance with VisualVM showed some hot spots that this PR aims to tweak and improve the performance of - notably: HeaderUtil.getFileHeaderWithExactMatch and HeaderReader.readCentralDirectory.

23759d1074271eaf0c2165c73390effa99747e0e preempts calls to HeaderUtil.getFileHeaderWithExactMatch in HeaderUtil.getFileHeader if the replacement operation resulted in no change to the file name being looked up.

abab5190bf7d57888d506ea0b5fcf768059256df changes the logic to read the central directory in larger chunks - instead of each field being read using the RandomAccessFile interface, the fixed size portions of the header are read into a byte array and RawIO is used to deserialize the data into fields.

e10dce33449c5f8597cfbf283748d97fa4e05453 adds a HashMap of file names to perform lookups on instead of traversing the CentralDirectory fileHeaders list for each lookup.

869ca8a9856f8328fcac9f1facbae4cde3bfa493 is a bit less-tested by me as none of my archives have comments. Instead of iterating backwards using the RandomAccessFile interface, the entire "End Zone" of the zip file where the header is present is read into a byte array, and the byte array is iterated to perform the end of central directory signature search.

Improvement is roughly 85% for ZipFile.isValidZipFile() and 40% for ZipFile.addFiles()

VisualVM Trace May be tainted by ZFS caching. Original logic, showing sample data for 122 archives processed ![Before](https://user-images.githubusercontent.com/6669567/181115051-d6d949cf-9282-48b9-a3d6-124dcb8591bd.png) Modified logic, showing sample data for 225 archives processed ![After](https://user-images.githubusercontent.com/6669567/181115072-b82b8636-2d78-42cc-9a0a-f0d63eef9adc.png)
Instrumented Log Output Sample Testing was performed with identical data using ZFS snapshots to revert to original state. "Zip opening time" represents the time for ZipFile.isValidZipFile() to return. "Add time" represents the time for ZipFile.addFiles() to return. VisualVM trace above does not represent this run, it was from a different date / dataset. Original Logic ``` Looking in /mnt/test/loose/00 Adding 1648 files to /mnt/test/zip/00.zip Archive filecount: 387130 Zip opening time: 7410 Add time: 55451 Files added: 1648 ms/f: 33 Looking in /mnt/test/loose/01 Adding 1588 files to /mnt/test/zip/01.zip Archive filecount: 387458 Zip opening time: 7226 Add time: 51670 Files added: 1588 ms/f: 32 Looking in /mnt/test/loose/02 Adding 1646 files to /mnt/test/zip/02.zip Archive filecount: 387135 Zip opening time: 7157 Add time: 53017 Files added: 1646 ms/f: 32 Looking in /mnt/test/loose/03 Adding 1637 files to /mnt/test/zip/03.zip Archive filecount: 386631 Zip opening time: 7151 Add time: 52027 Files added: 1637 ms/f: 31 ``` Modified Logic ``` Looking in /mnt/test/loose/00 Adding 1648 files to /mnt/test/zip/00.zip Archive filecount: 387130 Zip opening time: 1337 Add time: 28924 Files added: 1648 ms/f: 17 Looking in /mnt/test/loose/01 Adding 1588 files to /mnt/test/zip/01.zip Archive filecount: 387458 Zip opening time: 1137 Add time: 29550 Files added: 1588 ms/f: 18 Looking in /mnt/test/loose/02 Adding 1646 files to /mnt/test/zip/02.zip Archive filecount: 387135 Zip opening time: 1134 Add time: 31182 Files added: 1646 ms/f: 18 Looking in /mnt/test/loose/03 Adding 1637 files to /mnt/test/zip/03.zip Archive filecount: 386631 Zip opening time: 1123 Add time: 31958 Files added: 1637 ms/f: 19 ```

References Used: https://users.cs.jmu.edu/buchhofp/forensics/formats/pkzip.html https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT