unhandled Error: XRefParseException "Error: Invalid XRef stream header"

cyril23 commented 3 years ago

Information

Node: v10.23.3
npm: '6.14.13'
pdf2json@1.2.3
Test files:
- If you need it, I can email you my client's original PDF file.
  - The file can be opened in Adobe Acrobat, "optimized" to a compressed PDF and then the file can be read without problems, with both Adobe Reader and pdf2json.
  - I understand my client's PDF has a problem, and it's the fault of his PDF creator program.
  - But my problem is that pdf2json does not fire the "pdfParser_dataError" when handling invalid PDFs, see below. And this can be replicated using my "manually edited" PDF file, too, see below
- To avoid publishing my client's data here, I manually edited the PDF. Like this, it doesn't contain usable data, but seems to create a similar error: problem_file_anon.pdf

Expected behavior

The "pdfParser_dataError" event is fired, so I can handle the error and inform the user about the problematic PDF file

Actual behavior

An Exception is thrown, and I cannot handle it in my app:

(while reading XRef): Error: Invalid XRef stream header
<eval>/VM46947590:5682
XRefParseException
at XRefParseExceptionClosure (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:379:34)
at eval (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:384:3)
at Object.<anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1)
at Module._compile (internal/modules/cjs/loader.js:778:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:789:10)
at Module.load (internal/modules/cjs/loader.js:653:32)
at tryModuleLoad (internal/modules/cjs/loader.js:593:12)
at Function.Module._load (internal/modules/cjs/loader.js:585:3)
at Module.require (internal/modules/cjs/loader.js:692:17)
at require (internal/modules/cjs/helpers.js:25:18)
<eval>/VM46947590:32512
Error: Illegal character: 41
at error (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:195:9)
at Lexer_getObj [as getObj] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:24616:11)
at Parser_shift [as shift] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:24038:32)
at Parser_makeStream [as makeStream] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:24195:12)
at Parser_getObj [as getObj] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:24079:18)
at XRef_fetch [as fetch] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:5753:22)
at XRef_fetchIfRef [as fetchIfRef] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:5699:19)
at Dict_get [as get] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:4759:28)
at Page_getPageProp [as getPageProp] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:4213:28)
at Page.get content [as content] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:4227:19)
at Page_getContentStream [as getContentStream] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:4273:26)
at LocalPdfManager_ensure [as ensure] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:32506:24)
at Page_getOperatorList [as getOperatorList] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:4318:45)
at Object.eval [as onResolve] (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:27397:14)
at Object.runHandlers (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:864:35)
at ontimeout (timers.js:436:11)

Info: similar error my manually edited PDF which I uploaded here:

(while reading XRef): Error: Invalid XRef stream header
<eval>/VM46947590:5682
XRefParseException
    at XRefParseExceptionClosure (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:379:34)
    at eval (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:384:3)
    at Object.<anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1)
    at Module._compile (internal/modules/cjs/loader.js:778:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:789:10)
    at Module.load (internal/modules/cjs/loader.js:653:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:593:12)
    at Function.Module._load (internal/modules/cjs/loader.js:585:3)
    at Module.require (internal/modules/cjs/loader.js:692:17)
    at require (internal/modules/cjs/helpers.js:25:18)

Example Code

let PDFParser = require('pdf2json');
let pdfParser = new PDFParser();
pdfParser.on('pdfParser_dataError', errData => { // unfortunately, this is NOT fired in this case
    console.log(errData.parserError);
});
pdfParser.on('pdfParser_dataReady', pdfData => {
    console.log(pdfData.formImage);
});
pdfParser.loadPDF("./problem_file_anon.pdf"); // edit: fixed path

cyril23 commented 3 years ago

Seems like a corrupted XREF table and stream lengths. Can you fire the pdfParser_dataError in that case, too, please? edit: using qpdf on one of the problem files (original ones): qpdf using GhostScript:

GPL Ghostscript 9.26 (2018-11-20)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
   **** Error:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.
   **** However, the output may be incorrect.
Processing pages 1 through 3.
Page 1
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.
Page 2
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.
Page 3
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.

   **** This file had errors that were repaired or ignored.
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

   **** The rendered output from this file may be incorrect.
GS>

edit: using GhostScript with -dPDFDEBUG param: output-detail.txt

edit: ... or even better, try to fix the file errors or ignore them where possible

Tsopic commented 3 years ago

Having same error, any fix/solution for this?

Tsopic commented 3 years ago

I've tried to validate my PDF against multiple online PDF validators. And indeed it does not match the formatting rules. Added rule to my pre-processing validation.

export const isPDF = (buf: Buffer) => {
  return (
    Buffer.isBuffer(buf) &&
    buf.lastIndexOf("%PDF-") === 0 &&
    buf.lastIndexOf("%%EOF") > -1
  );
};

modesty commented 3 years ago

fix pushed. test with `git pull && rm -rf node_modules/ && npm i && npm run test-misc" please.

cyril23 commented 3 years ago

fix pushed

thanks a lot!

npm run test-misc

First I had to manually create the output directories via mkdir -p test/target/misc, but then the tests could be run:

$ npm run test-misc
# lots of output that I skipped here, see output.log below
Additional streams OK: 
 [
  'test/target/misc/i64_schedule_generator.content.txt',
  'test/target/misc/i64_schedule_generator.merged.json'
]

6 input files   4 success   2 fail  0 warning.
[
  '✓ Parse Success - i221_tianjin_invoice.pdf',
  '✗ Parse Exception: An error occurred while parsing the PDF: Error: Invalid XRef stream header - i243_problem_file_anon.pdf',
  '✓ Parse Success - i26_crash_18277.pdf',
  '✓ Parse Success - i28_line_break_210.pdf',
  '✗ Parse Exception: An error occurred while parsing the PDF: unsupported encryption algorithm - i43_encrypted.pdf',
  '✓ Parse Success - i64_schedule_generator.pdf'
]
pdf2json@1.2.5 [https://github.com/modesty/pdf2json]: 7.188s

Expected: 4 success, 2 failure
$

Full log: output.log

Tested on Ubuntu 16 LTS with Node 14

modesty commented 3 years ago

fixed in 1.2.5

modesty / pdf2json