Closed igorbednar closed 3 years ago
Minimal reproduction: setCharSpacingInsane.pdf
Chrome crashes because the page runs out of memory. Firefox does not crash but tries to use lots of memory (Gigabytes of memory, high CPU usage, lots of time spent on garbage collection, eventually an "out of memory" error in getTextContent in the console). Here is the profiler data: https://profiler.firefox.com/public/zsq9tz8bssr3wdky41djv0caafkv2kxntns91r0/flame-graph/?globalTrackOrder=0-1-2-3-4&localTrackOrderByPid=143552-0~143652-0~&thread=3&v=5
According to the profile from Firefox (and the call stack of Chrome's crash), something attempts to repeatedly grow an array.
I also see that lots of time is spent in addFakeSpaces
:
https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L1999-L2011
I attached a debugger and saw that width = 2147483649.949
and fakeSpaceMin = 6.831
, i.e. fakeSpaces = 314373247
= 300M+. This is a LOT, and easily explains the crash.
This large value for width
originates from textContentItem.lastAdvanceWidth
(as a very large negative value) @ https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L2121
from width
@ https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L1989
from charSpacing
@ https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L1976
from textState.charSpacing
@ https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L1963
from OPS.setCharSpacing
https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L2189-L2190
When I decompress the PDF file with qpdf --stream-data=uncompress --object-streams=disable document.pdf uncompressed.pdf
, and open the PDF in a binary editor, I see that it does indeed contain multiple setCharSpacing (= Tc
) commands with excessive values (e.g. -2147483647
) (plus instructions to to trigger the addFakeSpaces
code path):
/R2 8 Tf -2147483647 Tc -3.146 Tw
53.006 0 Td
To reproduce, the above instructions are repeated twice. The first run initializes textContentItem
, which causes the second moveText
command to pass the condition and enter the addFakeSpaces
function with excessively high values:
@timvandermeij I am not sure "corrupted pdf" is a correct label here as adobe acrobat reference implementation and chrome built in pdf viewer display this file without any problem. We have lots of clients that upload such files every day. It is hard to blame pdf file as other viewers work.
fakeSpaces = 314373247
= 300M+. This is a LOT, and easily explains the crash.
I've seen similar figures on my end. This is longer than the supported string size by Spidermonkey of ~268M:
The resulting string can also not be larger than the maximum string size, which can differ in JavaScript engines. In Firefox (SpiderMonkey) the maximum string size is 2^28 -1 (0xFFFFFFF).
Yet somehow NOT limiting the length inside addFakeSpaces
blows the process heap to ~13GB. Capping to maxString/2
(see below) the process maxes at ~4.4GB during render.
Since I did some testing with String.p.repeat
, I had to limit the length to (0xFFFFFFF/2): it is ~5s faster (and non-blocking to JS caller) than the incremental push version. Using .repeat
should in theory greatly reduce GC pressure, however I lack tools to confirm.
_Faster as in the CPU load by the process drops to zero 5s earlier (27s vs ~33s)._
Another for this case worrisome discovery is that after render's finished & Full GC/minimise is triggered, pdf.js worker still has 2GB occupied for some Array objects (memory-report.txt), on top of DOM ("only" 770MB) and misc. Is this (worker cache/buffer) really needed for single page/idle documents?
Testing done on: Firefox 80b3 x64 Ryzen 1700 (OS-governed clock up to 3.8 GHz steady state)
Modified pdf.js version from gh-pages branch:
var fakeSpaces = Math.round(Math.min(width / textContentItem.spaceWidth, 0xFFFFFFF/2));
while (fakeSpaces-- > 0) {
strBuf.push(" ");
}
}
vs
var fakeSpaces = Math.round(Math.min(width / textContentItem.spaceWidth, 0xFFFFFFF/2));
strBuf.push(" ".repeat(fakeSpaces));
}
@igorbednar Until shown otherwise, this seems to be the right classification for PDF files with excessive setCharSpacing
commands. I'm not saying we shouldn't do anything about this on our end, but just because people upload such files doesn't mean it's correct. Other PDF viewers have to do the same as PDF.js, which is implementing workarounds for cases that the PDF specification often disallows but that do happen in practice due to bad PDF generators. So yes, even if other viewers work the PDF file can certainly be to blame. If we got a dollar for every spec-violating PDF file/generator we have seen, we might have been rich by now...
Testing this PDF with below pdfJS version still seems to crash. The PDF initially loads and looks good, but then crashes a couple seconds later.
const pdfjsVersion = '2.7.90';
const pdfjsBuild = '6728c8fa6';
Configuration:
Steps to reproduce the problem:
Expected: Pdf should be displayed Actual:
Chrome tab crashes with "Aw snap" error