Closed bthorben closed 10 years ago
Update on our analysis: pdf.js seemed to process the fonts in the document many times, basically loading the font again for every page. This suggested a cache issue. For testing, we added a cache after the font was translated and that made the document 10x faster.
@bthorben Nice! Could you make a pull request for that if it solves the PDF.js issue?
Since there has been a lot of focus on reducing memory consumption of PDF.js lately, it would also be interesting to know if, and how, this kind of caching impacts the memory consumption.
Our "solution" is really just a quick hack here, something we added to test our theories about how PDF.js works. The way we cached is actually quite inefficient and doing it right would probably improve performance on this document another 2 - 4 times. We will spend more time to find an elegant solution.
@Snuffleupagus Can you give us some data that shows the problems with memory consumption? Regarding this issue, not generating the fonts many times but caching them reduces memory consumption when viewing this document considerably
Can you give us some data that shows the problems with memory consumption?
Sorry, I don't think I expressed myself clearly enough! I just meant that it would be nice, when you submit a PR, to include a comment about the memory consumption before and after the patch. (Nothing complicated, just something like e.g. in #4355.)
@Snuffleupagus, ok, I see. It would be much nicer if we could have actual benchmarking
We analysed the issue further. We wrote a small tool (available here) to gain insights into the document and its object graph. This is our conclusion on which we will base a solution:
The document makes use of at least one Type 0 font. Type 0 fonts are basically composed of
In this particular case there are many Type0 fonts (shown at [1]) which use the same CIDFont, as shown by this graph (extracted using an uncompressed version of the NASA budget using our tool):
The node on the left (177065 T6) is the font program of the CID Font, above that you see its FontDescriptor and the CIDFont dictionary. We shortened the graph, but on the right you see three Type 0 fonts that use this font. The nodes 28, 46 and 10 are the CMap dictionaries and they reference an array as their DescendantFonts that has our CID Font as it’s sole reference.
This situation shouldn’t be that special (I guess this makes sense for a linearised document) but here it gets interesting: The CMaps are all the same, which means that the Type 0 fonts all actually look the same. Since now PDF.js stores the translated Font object at the Type0 font node (more precisely: its parsed dictionary, compare [2]), for each font there will be another one created. This is what makes the NASA-Budget so slow in PDF.js.
[1]
### CONTENT OF 18490 ###
18490 0 obj
<<
/BaseFont /EZAGTP+Arial
/DescendantFonts 13076 0 R
/Encoding /Identity-H
/Subtype /Type0
/ToUnicode 28 0 R
/Type /Font
>>
### END CONTENT 18490 ###
### CONTENT OF 18496 ###
18496 0 obj
<<
/BaseFont /EZAGTP+Arial
/DescendantFonts 13086 0 R
/Encoding /Identity-H
/Subtype /Type0
/ToUnicode 46 0 R
/Type /Font
>>
### END CONTENT 18496 ###
### CONTENT OF 18483 ###
18483 0 obj
<<
/BaseFont /EZAGTP+Arial
/DescendantFonts 13067 0 R
/Encoding /Identity-H
/Subtype /Type0
/ToUnicode 10 0 R
/Type /Font
>>
### END CONTENT 18483 ###
[2] this.fontCache.put(fontRef, font);
in src/core/evaluator.js
Our solution is relatively simple: We create a cache at the font-descriptor of the CIDFont that is indexed by encoding. This means if the encoding is the same the expensive font translation will be done only once.
Performance is extremely poor when viewing the NASA 2014 Budget request available at http://www.nasa.gov/pdf/750614main_NASA_FY_2014_Budget_Estimates-508.pdf