Parallel tests - Githubissues

boredland commented 9 years ago

Hi there, since I've a lot of processing-cores I thought perhaps one could use djvubind a little bit more parallel. So I started trying to parallelize externally with gnu parallels. I think my results are worth mentioning: When creating a bunch of smaller djvu-jobs that are merged by djvm, that decreases the time needed for a job around a third - the more compression needed, the more time you save using it. I think you don't do that natively, because of your library-compression. Suprisingly my results are of the same size or smaller - I don't know why. I put that together into a script around with some example-files. You can try it simply unpacking it and running ./testscript. Make sure to set the variables in the top of the script according to your system. Critique will be kindly appreciated. http://www.file-upload.net/download-10445190/test.tar.bz2.html Boredland

strider1551 commented 9 years ago

You are correct that the threading the encoding step could reduce the execution time, as your test script and files demonstrate.

The reason I never did that in the first place was minidjvu, which is an alternative to cjb2 that creates a shape dictionary shared across multiple pages, resulting in a smaller file. With that encoder instead of looking to do things in parallel, you're looking to give it as many pages as possible to create that shared dictionary.

But just because I can't thread that portion in all cases doesn't mean that it shouldn't be smart enough to thread it when that's a good idea...

boredland commented 9 years ago

But wouldn't that mean, that (in size-comparison) files compressed iteratively in the standard-method would be smaller in size than those compressed in clusters and merged afterwards? I did not see that happen once and used "my" method quite often.

2015-04-13 1:34 GMT+02:00 Adam Zajac notifications@github.com:

You are correct that the threading the encoding step could reduce the execution time, as your test script and files demonstrate.

The reason I never did that in the first place was minidjvu, which is an alternative to cjb2 that creates a shape dictionary shared across multiple pages, resulting in a smaller file. With that encoder instead of looking to do things in parallel, you're looking to give it as many pages as possible to create that shared dictionary.

But just because I can't thread that portion in all cases doesn't mean that it shouldn't be smart enough to thread it when that's a good idea...

— Reply to this email directly or view it on GitHub https://github.com/strider1551/djvubind/issues/17#issuecomment-92151259.

strider1551 commented 9 years ago

I'm not quite sure I'm following what you are saying. Let me explain my understanding of how djvu works in this matter, to the best of my understanding.

One thing that makes djvu files so efficient is the shape dictionary. If "s" appears 20 times in an image, it compresses it just one time and records the twenty places to put it. When cjb2 is used, each page has it's own shape dictionary that do not reference each other, so if you have 50 pages and they all have the letter "s", that shape is stored fifty times. When you combine the pages with djvm, those dictionaries are not combined; you still have fifty dictionaries with fifty shapes for "s". With minidjvu all 50 pages reference the same, single dictionary, so the shape for "s" is stored once across all fifty pages. cjb2 you can do in parallel, because the pages don't reference each other's dictionaries; minidjvu has to be run in one instance and given all the pages you can so that it can build the best shared shape dictionary.

As far as your test goes, when djvubind is configured to use cjb2 --lossless I do see a decrease in execution time (201 -> 151) but I see the same result for filesize (7.7MB). When using minidjvu --aggression 50 --pages-per-dict 100, I see a significant drop in execution time (545 -> 191) and a slight increase in filesize (4.4MB -> 4.5MB).

strider1551 / djvubind

Parallel tests #17