outr / uberzip

Very fast multi-threaded unzipping utility.
3 stars 1 forks source link

Need to create one ZipFile instance per thread #1

Open lukehutch opened 6 years ago

lukehutch commented 6 years ago

UberZip is currently a lot less efficient than it could be. You're currently creating only one ZipFile instance, and sharing that with all worker tasks. There's a lock around all ZipFile methods (check out the source code) that prevents more than one thread from doing work at any given time. This is why in your blog post you weren't getting the performance you expected: http://www.matthicks.com/2008/01/multithreaded-unzip.html

If you want multiple threads to read from the same zipfile in a scalable way, you must open one ZipFile instance per thread. That way, the per-thread lock in the ZipFile methods does not block all but one thread from reading from the zipfile at one time. (It also means that when each thread closes the ZipFile after they're done reading, they close their own instance, not the shared instance, so you don't get an exception on the second and subsequent close.)

darkfrog26 commented 6 years ago

Ah, a very good optimization suggestion. I keep meaning to come back around to this but keep putting it off. Now, I have extra incentive. :)

Thanks!

lukehutch commented 6 years ago

You're welcome. Another optimization is to read all the ZipEntry objects just once for a single ZipFile instance, and then have the workers use those ZipEntry objects with their own ZipFile instance. This means you only have to read the zip directory once.

darkfrog26 commented 6 years ago

I've actually considered that but was concerned about memory consumption on ZIP files with an extremely high number of entries.

lukehutch commented 6 years ago

'ZipEntry' objects are a few dozen bytes each, they only contain metadata (filename, modification time, offset, length, etc.), nothing more.

lukehutch commented 6 years ago

You inspired me to write my own version as a proof of concept -- see the code here: https://github.com/lukehutch/quickunzip

darkfrog26 commented 6 years ago

Nice work! When I have some more time I'll take a better look.

If only it were written in Scala. ;)

lukehutch commented 6 years ago

Feel free to port it :D

I started using Scala a few years ago and gave up. I found it way too unnecessarily complicated, and quite unpredictable and arbitrary at times. It is the result of one too many of Martin Odersky's students being told they needed to bolt a feature onto the language in order to graduate! You should check out Ceylon as a much simpler and more sane alternative to Java.

lukehutch commented 6 years ago

PS there are quite a few corner cases that you have to check for to ensure that a zipfile is safely unzipped (sanitizing paths to ensure they don't break out of the destination dir using "..", for one thing), and your code doesn't do any checking right now. Since your code seems to have gotten some traction, you might want to take a look at all the corner cases I'm checking for, even if you don't fully port the code.

darkfrog26 commented 6 years ago

I appreciate that. Will do.

With regard to Scala, I couldn't disagree more. Yes, the initial learning is more complex, but having spent most of my career in Java I would say the "bolt a feature" on issue is a much bigger problem there. I've used both Ceylon and Kotlin and though they are simpler, they are also simpler languages. :-p Programming is hard, and learning a complex language can make programming hard things a little easier. I believe most of the complexity in Scala improves it instead of compromising for lesser coders. In addition, I would say that in many ways Scala is more logical and simple than most languages once you actually understand it.

lukehutch commented 6 years ago

All fair comments re. languages, and I'm glad you have tried others.

BTW I just found out that some of the unzip vulnerabilities I protected against now have a name, Zip Slip:

https://snyk.io/research/zip-slip-vulnerability

This is not a new vulnerability, I remember first learning about it around 20 years ago, but apparently it has now entered the Internet consciousness, and a large number of unzip code snippets out there are vulnerable, including uberzip.

darkfrog26 commented 6 years ago

@lukehutch, very good information. I have intentions of bringing this project into my Batcher (https://github.com/outr/batcher) project in the near future when I find more time to work on it. If you'd be interested in collaborating and can get past your distaste for Scala then I'd happily add you as a contributor. The idea of Batcher is a module-based command-line tool to allow batching of operations to both allow sanity checks before running as well as parallelizing and managing of jobs sequentially and asynchronously. This includes disk/network operations like copying/moving files as well as more CPU intensive operations like extracting a ZIP file.

lukehutch commented 6 years ago

@darkfrog26 That actually looks like a useful thing to build. However, I'm currently starting work on my own automatically parallelizing programming language that will do a lot of the same things as Batcher, only in language form, rather than library form. The core idea is to build a DAG of data dependencies between operations, similar to Spark/Flink or TensorFlow, but do this at a very fine-grained level. I'm hoping to make it really easy to build data analytics pipelines, and to provide parallelization for free. I'll be pretty tied up on this project for the foreseeable future. But maybe I can recruit you, since you are thinking about similar things ;-)

darkfrog26 commented 6 years ago

@lukehutch, it sounds like a really interesting endeavor. I generally cringe when I hear that yet another language is being created, but I promise to keep an open mind. :) Feel free to connect with me on LinkedIn and we can talk further (https://www.linkedin.com/in/matthicksok/).

lukehutch commented 6 years ago

Oh, there are definitely a lot of languages in the world, maybe too many. But very few really great languages, so I think there's room for more!