mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.35k stars 9.97k forks source link

addFakeSpaces can easily trigger out-of-memory and cause Chrome to crash #12142

Closed igorbednar closed 3 years ago

igorbednar commented 4 years ago

Configuration:

Steps to reproduce the problem:

  1. Load attached document.pdf in https://mozilla.github.io/pdf.js/web/viewer.html document.pdf

Expected: Pdf should be displayed Actual: image

Chrome tab crashes with "Aw snap" error

Rob--W commented 4 years ago

Minimal reproduction: setCharSpacingInsane.pdf

Chrome crashes because the page runs out of memory. Firefox does not crash but tries to use lots of memory (Gigabytes of memory, high CPU usage, lots of time spent on garbage collection, eventually an "out of memory" error in getTextContent in the console). Here is the profiler data: https://profiler.firefox.com/public/zsq9tz8bssr3wdky41djv0caafkv2kxntns91r0/flame-graph/?globalTrackOrder=0-1-2-3-4&localTrackOrderByPid=143552-0~143652-0~&thread=3&v=5

According to the profile from Firefox (and the call stack of Chrome's crash), something attempts to repeatedly grow an array. I also see that lots of time is spent in addFakeSpaces : https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L1999-L2011

I attached a debugger and saw that width = 2147483649.949 and fakeSpaceMin = 6.831, i.e. fakeSpaces = 314373247 = 300M+. This is a LOT, and easily explains the crash.

This large value for width originates from textContentItem.lastAdvanceWidth (as a very large negative value) @ https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L2121 from width @ https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L1989 from charSpacing @ https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L1976 from textState.charSpacing @ https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L1963 from OPS.setCharSpacing https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L2189-L2190

When I decompress the PDF file with qpdf --stream-data=uncompress --object-streams=disable document.pdf uncompressed.pdf, and open the PDF in a binary editor, I see that it does indeed contain multiple setCharSpacing (= Tc) commands with excessive values (e.g. -2147483647) (plus instructions to to trigger the addFakeSpaces code path):

/R2 8 Tf -2147483647 Tc -3.146 Tw 
53.006 0 Td

To reproduce, the above instructions are repeated twice. The first run initializes textContentItem, which causes the second moveText command to pass the condition and enter the addFakeSpaces function with excessively high values:

https://github.com/mozilla/pdf.js/blob/403816040ef64ae17758439948771c2920f96ac4/src/core/evaluator.js#L2107-L2123

igorbednar commented 4 years ago

@timvandermeij I am not sure "corrupted pdf" is a correct label here as adobe acrobat reference implementation and chrome built in pdf viewer display this file without any problem. We have lots of clients that upload such files every day. It is hard to blame pdf file as other viewers work.

VADemon commented 4 years ago

fakeSpaces = 314373247 = 300M+. This is a LOT, and easily explains the crash.

I've seen similar figures on my end. This is longer than the supported string size by Spidermonkey of ~268M:

The resulting string can also not be larger than the maximum string size, which can differ in JavaScript engines. In Firefox (SpiderMonkey) the maximum string size is 2^28 -1 (0xFFFFFFF).

Yet somehow NOT limiting the length inside addFakeSpaces blows the process heap to ~13GB. Capping to maxString/2 (see below) the process maxes at ~4.4GB during render.

Since I did some testing with String.p.repeat, I had to limit the length to (0xFFFFFFF/2): it is ~5s faster (and non-blocking to JS caller) than the incremental push version. Using .repeat should in theory greatly reduce GC pressure, however I lack tools to confirm. _Faster as in the CPU load by the process drops to zero 5s earlier (27s vs ~33s)._

Another for this case worrisome discovery is that after render's finished & Full GC/minimise is triggered, pdf.js worker still has 2GB occupied for some Array objects (memory-report.txt), on top of DOM ("only" 770MB) and misc. Is this (worker cache/buffer) really needed for single page/idle documents?


Testing done on: Firefox 80b3 x64 Ryzen 1700 (OS-governed clock up to 3.8 GHz steady state)

Modified pdf.js version from gh-pages branch:

      var fakeSpaces = Math.round(Math.min(width / textContentItem.spaceWidth, 0xFFFFFFF/2));
      while (fakeSpaces-- > 0) {
        strBuf.push(" ");
      }
    }

vs

      var fakeSpaces = Math.round(Math.min(width / textContentItem.spaceWidth, 0xFFFFFFF/2));
      strBuf.push(" ".repeat(fakeSpaces));
    }
timvandermeij commented 4 years ago

@igorbednar Until shown otherwise, this seems to be the right classification for PDF files with excessive setCharSpacing commands. I'm not saying we shouldn't do anything about this on our end, but just because people upload such files doesn't mean it's correct. Other PDF viewers have to do the same as PDF.js, which is implementing workarounds for cases that the PDF specification often disallows but that do happen in practice due to bad PDF generators. So yes, even if other viewers work the PDF file can certainly be to blame. If we got a dollar for every spec-violating PDF file/generator we have seen, we might have been rich by now...

amit777 commented 4 years ago

Testing this PDF with below pdfJS version still seems to crash. The PDF initially loads and looks good, but then crashes a couple seconds later.

const pdfjsVersion = '2.7.90';
const pdfjsBuild = '6728c8fa6';
calixteman commented 3 years ago

Fixed by https://github.com/mozilla/pdf.js/pull/13257.