Open Marv51 opened 3 years ago
Unfortunately, 3rd party tools for ZIP-archives are quite a mixed bag. A high quality tool that is fast, secure, user-friendly and free does not seem to exist.
see this thread, turns out the ancient code from 1998 isn't very efficient, it's reading one byte at a time
There are a number of core performance problems with the ZIP code. The one related to "Move" operations (mentioned just above), is more fully described here: https://textslashplain.com/2021/06/02/leaky-abstractions/
Unfortunately, 3rd party tools for ZIP-archives are quite a mixed bag. A high quality tool that is fast, secure, user-friendly and free does not seem to exist.
I would say libarchive fits the job. It has already part of Windows since https://techcommunity.microsoft.com/t5/containers/tar-and-curl-come-to-windows/ba-p/382409.
See also https://github.com/microsoft/Windows-Dev-Performance/issues/27
This ticket has good sample zip file to work with.
https://github.com/microsoft/Windows-Dev-Performance/issues/27#issuecomment-670613369
My personal experience is my work Dell Precision laptop with SSD, unzipping a 50MB file with 25k files inside:
The gap is ... yawning.
I disabled all the AV and Windows time came down to 6m. Might be bitlocker is affecting the speed (or one of the other security agents).
I doubt BitLocker is affecting performance to that degree -- on modern CPUs (iirc Haswell+, aka Intel Core 4th gen) the perf hit should be <1% since the encryption can use specialized CPU instructions (I think I read this on Anandtech somewhere). On less-than-modern CPUs, I'd guess 5 or maybe 10% at worst. Definitely not two orders of magnitude!
I tried unzipping the #27 flutter zip above, took around 20m.
Inside linux VM on same laptop, a few seconds.
These kinds of basic slowdowns are the most valuable and most like the hardest to fix in Windows.
Well, this not 100% related to Windows ZIp but I thought I'd post this here for others that might have a similar issue.
I got our IT department to temporarily disable BeyondTrust, as Avecto Dependpoint Service (defendpointservice.exe) was consuming a large amount of CPU during unzip and particularly delete and copy operations.
The difference was dramatic on my dell precision laptop with 2TB ssd. This was most obvious when deleting or copying the files once unzipped:
Also impacts on ZIP speed but not as dramatic, I suspect due to the previously noted poor implementation of Windows Explorer Zip somewhat masking the BT issue.
Interestingly, I have a separate Anti Virus installed, Coretex, which seems to have much smaller filesystem performance impact than BT. Oh, the irony!
https://twitter.com/ericlaw/status/1399856554690715659 https://devblogs.microsoft.com/oldnewthing/20180515-00/?p=98755
Yikes. I always avoided doing any IO heavy tasks under the shell for reasons like this. It would be nice to see this overhauled, perhaps with support for some modern compression algorithms such as zstd. Time for @PlummersSoftwareLLC to come out of retirement.
I'd say the zip tools need a proper overhaul, not only performance fixes, as the UX alone is bad enough to make me install 3p tools, absolute minimum would be 7-zip context menu equivalent, adding smart compression that skips compression of files that doesn't reduce size enough and double-click-to-extract (with option to enable removal of source file when done) would be another step
And I absolutely never wanted the wizard that adds few extra clicks to my flow
adding smart compression that skips compression of files that doesn't reduce size enough and double-click-to-extract
God no. The current implementation is already problematic in enough ways. Making compression inherently slower by forcing an additional large section read after throwing out CPU time on a deflation operation, just because the result didn't match an arbitrary (probably hard-coded 'sane') compression ratio, deployed across the most potato spinning disks, Intel Shitrons, and the likes, just to save a few ms-to-seconds when deflating massive blobs on high end machines is stupid. Do you want to be responsible for the hack that throws away and recalculates archive dictionaries over late and very much arbitrary deflate/store selection? Also noting that, an optimal multi-threaded implementation would have an even harder time rolling back. It's a nice idea, but in reality, it's not practical, it's going to require way too many man hours to implement, the ux is going to be too hard for the average normie to understand, and it's going to make the likes of your mom compressing family photos more stressed because their budget laptop is going to be wasting resources to meet an arbitrary store-instead-of-deflate threshold instead of just archiving with the intent to compress.
PS; do you have any data whatsoever that suggests this method of 'smart' compression would benefit anyone or is just an idea
regarding data I performed quick and dirty experiment using Bandizip to compress my downloads folder (4 946 928KB) containing mostly poorly compressible files (installers, videos and so on) with as they call it High speed archiving active and not, using Deflate setting maximum available compression (except for the HSA switch)
with this out of the way I decided to try some well compressible data like a source code with almost no binaries, 1 029 669KB
after that I added some video files and archives to the mix:
so in a mixed bag it definitely works and is worth it, for everything else it looks like the way they determine if they should use Deflate or Store is not efficient enough to be worth it, at least on my machine
to dive deeper I tried constraining the process to single thread while compressing my downloads:
so it looks like sometimes it's worth it and sometimes not, possibly with improved detection it could be made to be worth it almost every time, but I'm not intending to push this idea any further
but it seems hat these weaker machines you are worried about would be the least punished in cases when it doesn't work as well
so it looks like sometimes it's worth it and sometimes not, possibly with improved detection it could be made to be worth it almost every time, but I'm not intending to push this idea any further
The problem I see with this is that, the detection requires you to run the deflate algorithm anyway. One could probably buffer the raw streams to solve the double read issue; however, you're still stuck with this problem of "detection." The only non-hack way to do solve this would be to accept the CPU overhead of compressing once & buffer the output + changes to the compression dictionary before committing. Most compressors don't scale to multiple threads and the non-hack solution would toll the CPU just as much. Worse, I still suspect those few multi-threaded compressors would suffer the most - if their dictionaries have to be thrown away to account for dropped chunks on late file omission. From my tests with zstd, flushing at the end of each substream doesn't obliterate the compression ratio (+1GB, end: 18GB +- 1GB, total in: ~30GB raw assets), which might be indicative of the extent to which this could scale.
My concern with these numbers is they all come from proprietary software whose baseline and implementation efficiencies are unknown -- we have no idea what they did to fudge those numbers. Needless to say, I would suspect bandizip of inefficiently compressing files, delegating throwaway compression operations to an otherwise wasted thread to solve the store/deflate question, and using preemptive hacks to throw away likely-uncompressible files (not so smart. you wouldn't want this advertised as the "smart" method uninformed users will treat like the defacto compress button).
I think we should be focused on the state of parallelization in the world of archives rather than cheap hacks to throw away probably-uncompressible files. Considering accurate detection could be as expensive as the deflation itself, those resources would be best spent on compressing another chunk or file.
Off the top of my head: Algorithm/Main impl: zstd: full mt support igors 7zip: partial mt support, nasty codebase, lzma is awfully slow, some build configurations delegate io to the main thread liblzma: theoretically compatible with mt compressed .xzs, no mt compression support stock gzip: single threaded zlib: single threaded lz4: single threaded, api requires large static caller-provided const buffers between api calls
Containers: .zip: no multithread support; read-only access could be parallelized .gzip: no multithread support; read-only access could be parallelized .xz: native multithread support .tar: can be multithreaded
I believe constraints of the containers are a bit of a problem if you wanted to take multi-threaded compression seriously. An entity as large as Microsoft could look into rolling their own to solve this UX issue for once and for all. Even an OSP solution to solve this issue once and for all would be nice.
15 competing standards
zip container, despite all shortcomings, is used for basically everything nowadays and is the go to container format for anyone intending to create their own container
but as there already exists a decent MT capable algorithm and container it would be nice to have support for it too
It is kind of hilarious when I have to wait ~10 mins. to extract a specific folder from a 1GB zip file. The explorer Window shows transfer rates of a few KB/s. On a 6 core laptop with NVMe.
It's also kind of ridiculous that numerous folks at Microsoft know about this, they also talk and complain about it on Twitter, and yet none of them has just gone in and fixed it.
Even changing the ReadFile
calls to operate on more than 1 byte at a time or use any form of user-space caching (re: the comment above, https://github.com/microsoft/Windows-Dev-Performance/issues/91#issuecomment-853081988) would be a low-risk, high-performance gain.
(If nobody at Microsoft is going to take action on the issues filed in this repo, why does this repo even exist?)
Actually, this will probably be the reason to migrate to Linux.
This is not the first time Linux is faster, but in my new project, all data is in zipped csv updated daily). I can't believe unzipping is so slow.
7-zip is a lot faster at uncompressing zip files than Windows Explorer. I have hoped a change with recent Windows 11 support for RAR/7Z/libarchive, but it is still quite slow. Eclipse and other IDEs, Jenkins and Artifactory are good exemples of large zips.
@nmoreaud, Are you running a version of Windows 11 that supports decompressing .rar
, .7z
, etc.?
@Bosch-Eli-Black I thought so but I was wrong. I'll wait for the update.
@nmoreaud Sounds good :) Would be awesome if this update also made ZIP files much faster! :D
Libarchive support was added to FileExplorer in the Sept 2023 & 23H2 update for extraction. This complements the support added at the commandline via tar.exe.
I would keep this issue open as apparently PowerShell's Expand-Archive is still faster than Explorer.exe's (even with the current Win11 build - 22631.3007):
https://www.reddit.com/r/PowerShell/comments/1972k40/why_is_expandarchive_significantly_faster_than/
With this zip : https://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/2023-12/R/eclipse-java-2023-12-R-win32-x86_64.zip 7-zip: 3.7s - 8s Explorer: 21s Powershell: 30s
Build 22621.2715 Windows Feature Experience Pack 1000.22677.1000.0 The antivirus could alter the results
Appreciate the metrics!
Extracting https://github.com/dotnet/msbuild/archive/refs/tags/v17.10.4.zip (10 MB, ~2500 files)
Command | Duration |
---|---|
tar -xf (libarchive) |
0.8 seconds |
7-Zip 7z x |
1.3 seconds |
Powershell's Expand-Archive |
4 seconds |
Windows Explorer "Extract all" | 45 seconds |
Windows 11 Pro 23H2 22631.3593
@AdamBraden @AvriMSFT Please could you reopen the issue?
Re-opening to track File Explorer zip perf. Was incorrectly closed earlier.
Tagging @DHowett for visibility.
Isssue description
The zip uncompression in Windows Explorer is not very performant. Depending on the zip archive it is so painfully slow that it is unusable. It does not utilize modern multi-core CPU cores and does IO in incredibly inefficient ways.
This is a very known issue (https://twitter.com/BruceDawson0xB/status/1399931424585650180, https://devblogs.microsoft.com/oldnewthing/20180515-00/?p=98755) and has been for years. There is no real diagnosis here need: The library doing the unzipping, is from 1998 and nobody at Microsoft knows how it works.
Developers work with zip files very often. Want to quickly look at an archive of zipped log files from something? Might as well make a coffee while you wait.
For me the process is often like this: Start unzip from Explorer, get frustrated that it takes so long, open 7-zip (or similar), uncompress there again, 3rd party tool finishes unzip, I check over to Explorer which is still at around 10% done.
Steps to reproduce
Uncompress a zip file, especially one containing many small files
Expected Behavior
Unzip should be very fast, to enable a developer to stay in the flow and not be slowed down by unnecessary wait times.
Actual Behavior
The zip compression tool in Explorer is so painfully slow that for most devs it is unusable.
Windows Build Number
10.0.19043.0
Processor Architecture
AMD64
Memory
8GB
Storage Type, free / capacity
512GB SSD
Relevant apps installed
None
Traces collected via Feedback Hub
I can if needed