modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.97k stars 376 forks source link

Silent error on parse semi-transparent content #157

Open alexlcddd opened 6 years ago

alexlcddd commented 6 years ago

Hello, recently I found bug with parsing some .pdf files. If file have symbols colored in semi-transparent color, a program just stops without any error messages. Here's code which stops silent (declaration.pdf - "unparsable" document):

let fs = require('fs'),
    PDFParser = require("pdf2json");

let pdfParser = new PDFParser(this, 1);

pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
pdfParser.on("pdfParser_dataReady", pdfData => {
    fs.writeFile("./content.txt", pdfParser.getRawTextContent());
});

pdfParser.loadPDF("./Declaration.pdf");

So, after I tried to parse document directly with pdf2json: node pdf2json.js "../../documents/whitepaper/47/WP_En.pdf" "../../" And I got some errors:

Warning: Unhandled rejection: Error: JPEG error: Unsupported color mode (4 components)
Error: JPEG error: Unsupported color mode (4 components)
    at error (eval at <anonymous> (app\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:195:9)
    at JpegStream_ensureBuffer [as ensureBuffer] (eval at <anonymous> (app\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:25571:7)
    at JpegStream.DecodeStream_getBytes [as getBytes] (eval at <anonymous> (app\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:24875:14)
    at PDFImage_getImageBytes [as getImageBytes] (eval at <anonymous> (app\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:21009:25)
    at PDFImage_fillRgbaBuffer [as fillRgbaBuffer] (eval at <anonymous> (app\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:20941:27)
    at PDFImage_getImageData [as getImageData] (eval at <anonymous> (app\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:21004:12)
    at eval (eval at <anonymous> (app\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:7125:34)
    at Object.eval [as onResolve] (eval at <anonymous> (app\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:20657:7)
    at Object.runHandlers (eval at <anonymous> (app\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:864:35)
    at ontimeout (timers.js:466:11)

In a nutshell, my solution was to comment 2 lines(832-833) in /base/core/jpg.js:

if (!this.adobe)
    throw 'Unsupported color mode (4 components)';

After that I successfully parsed my document.

So, @modesty, can you fix this or just remove?

containerman17 commented 4 years ago

++

bernatbombi commented 2 years ago

++

modesty commented 2 years ago

can you upload the PDF?

fawmi commented 2 years ago

This error still occurs, in the last version. Which non public method do you prefer, so that I send you a copy of a pdf file?

PiratesKing13 commented 6 months ago

any update on this issue? I have the same problem