mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.32k stars 9.97k forks source link

Major performance issues for a simple PDF file #6961

Closed timvandermeij closed 8 years ago

timvandermeij commented 8 years ago

I have created a simple PDF file using Scribus 1.5.0svn. Notice that the file size is large for a two-page PDF file, so I can only suspect that Scribus is doing something really unefficient when exporting the PDF file. Nevertheless, the PDF file below renders instantly with both Adobe Acrobat Reader DC and Foxit Reader (within 0.5 seconds), however PDF.js takes 27 seconds to render only the first page of this file. I have no idea why PDF.js is taking such an excessive amount of time, but we need to do better here, given that other viewers do not have any problems with this file. Are there perhaps inefficient patterns in this file that the optimizer could remove? I notice that the PDF file contains an excessive amount of resources in the Resources dictionary of each page (XObject, Font, Pattern and ExtGState).

Below is the PDF file. I made this myself, so anyone is free to use this as a test case in a PR that addresses this issue: test.pdf

Rob--W commented 8 years ago

The CPU profiler shows that most of the time is spent in PartialEvaluator_hasBlendModes: https://github.com/mozilla/pdf.js/blob/a0aa781c394a8e5dedbf14c5db0195ab72e8ff46/src/core/evaluator.js#L144-L200

To profile (using Chrome), simply:

  1. Download test.pdf from above.
  2. Visit https://mozilla.github.io/pdf.js/web/viewer.html?file=
  3. Open the developer tools, go to the Profiles tab.
  4. Open test.pdf with PDF.js
  5. Select "Target: pdf.worker.js" and click on Start to profile
  6. Whenever you feel ready (e.g. when the CPU is idle), stop the capture.
  7. Analyse the results.
timvandermeij commented 8 years ago

Thank you for profiling this! This code seems to loop over (potentially) all ExtGState and XObject dictionaries, and there are a lot of them in this PDF file. I'm afraid the in operator might cause delays here.

timvandermeij commented 8 years ago

I have narrowed down the flaw a bit. For the first page, https://github.com/mozilla/pdf.js/blob/a0aa781c394a8e5dedbf14c5db0195ab72e8ff46/src/core/evaluator.js#L182 is triggered almost 31000 times, meaning that we attempt to process a lot of XObjects that we already processed before.

Snuffleupagus commented 8 years ago

This seems to do the trick, but I've not had time to run tests yet: https://github.com/mozilla/pdf.js/compare/master...Snuffleupagus:issue-6961. Also, I'm not sure what kind of test we can add for this, will look into this more later tonight.

timvandermeij commented 8 years ago

Funny, I also tried using getKeys() in a local test, but with no result, possibly because I missed https://github.com/mozilla/pdf.js/compare/master...Snuffleupagus:issue-6961#diff-0b94c2e77a5259f7a728122fdbf9f46aR182. Thanks for looking into this!

Snuffleupagus commented 8 years ago

So, my patch seems to pass all tests locally, but there're two problems:

timvandermeij commented 8 years ago

The only test I can imagine is a unit test where we assert that the number of processed XObjects is less than it is currently. It's not great, but it's the only kind of test I can come up with since measuring the runtime is not an option. Otherwise I think it suffices to review the patch, test it manually and make sure that the test suite passes.

Regarding the second point, I would have to look into this more. If they are equal then the current code should do, so I'm not yet sure what the difference is with your patch.

Rob--W commented 8 years ago

First of all, how do we test this? Given that the run time is hardware/software dependent, I'm not sure how we can assert that a test doesn't run for too long!?

27 seconds for such a simple PDF is excessive. We could create a suite of PDFs whose rendering time is measured (can be as simple as subtracting two time stamps), and then report the results to some central place. If we occasionally look at the results (e.g. a table, or a fancy graph), then we should get a good picture of what rendering times are normal and detect performance regressions.

Second of all, I don't understand why the patch actually works ;-)

What exactly is unclear? Replacing getAll with getKeys seems like an obvious boost, is there something else with your patch that magically improves the runtime?

Snuffleupagus commented 8 years ago

What exactly is unclear? Replacing getAll with getKeys seems like an obvious boost, is there something else with your patch that magically improves the runtime?

The getKeys change is not noticeable in the grand scheme of things, the real improvement comes from https://github.com/mozilla/pdf.js/compare/master...Snuffleupagus:issue-6961#diff-0b94c2e77a5259f7a728122fdbf9f46aR186. In practice that check seem, for all intents and purposes, to be equal to an already existing check just below (as I commented above):

https://github.com/Snuffleupagus/pdf.js/blob/21a19c1a1cdeb1bb9ddae8a44e7da00c241c899e/src/core/evaluator.js#L193-L197 ought to be enough, since as far as I can tell xObject.dict.objId always seem to equal xObjects.getRaw(key).toString() in this PDF file.

Rob--W commented 8 years ago

Using getAll results in traversing the whole tree and fetching Refs. Using getKeys merely requires looking up all keys. But I think that if you call fetch on every Ref value, then you're getting a similar runtime as when you're using .getAll. By skipping .fetch if the Ref is already known, you're saving the overhead of resolving Refs.

(this is my guess, I didn't check whether it is really the reason).

yurydelendik commented 8 years ago

The reason above somewhat right. We had a thought to disable getAll(), not for performance but because it can pull recursively not-needed data into operator list. We probably need to review all getAll usages and remove it.

Snuffleupagus commented 8 years ago

@Rob--W You are absolutely correct, I don't know how I missed that myself. Thank you!