modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
2k stars 376 forks source link

unhandled Error: XRefParseException "Error: Invalid XRef stream header" #243

Closed cyril23 closed 3 years ago

cyril23 commented 3 years ago

Information

Expected behavior

Actual behavior

Info: similar error my manually edited PDF which I uploaded here:

(while reading XRef): Error: Invalid XRef stream header
<eval>/VM46947590:5682
XRefParseException
    at XRefParseExceptionClosure (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:379:34)
    at eval (eval at <anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1), <anonymous>:384:3)
    at Object.<anonymous> (c:\Source\Apps\Projects\PdfApp\node_modules\pdf2json\lib\pdf.js:64:1)
    at Module._compile (internal/modules/cjs/loader.js:778:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:789:10)
    at Module.load (internal/modules/cjs/loader.js:653:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:593:12)
    at Function.Module._load (internal/modules/cjs/loader.js:585:3)
    at Module.require (internal/modules/cjs/loader.js:692:17)
    at require (internal/modules/cjs/helpers.js:25:18)

Example Code

let PDFParser = require('pdf2json');
let pdfParser = new PDFParser();
pdfParser.on('pdfParser_dataError', errData => { // unfortunately, this is NOT fired in this case
    console.log(errData.parserError);
});
pdfParser.on('pdfParser_dataReady', pdfData => {
    console.log(pdfData.formImage);
});
pdfParser.loadPDF("./problem_file_anon.pdf"); // edit: fixed path
cyril23 commented 3 years ago

Seems like a corrupted XREF table and stream lengths. Can you fire the pdfParser_dataError in that case, too, please? edit: using qpdf on one of the problem files (original ones): qpdf using GhostScript:

GPL Ghostscript 9.26 (2018-11-20)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
   **** Error:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.
   **** However, the output may be incorrect.
Processing pages 1 through 3.
Page 1
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.
Page 2
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.
Page 3
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.
   **** Error: stream Length incorrect.
               Output may be incorrect.

   **** This file had errors that were repaired or ignored.
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

   **** The rendered output from this file may be incorrect.
GS>

edit: using GhostScript with -dPDFDEBUG param: output-detail.txt

edit: ... or even better, try to fix the file errors or ignore them where possible

Tsopic commented 3 years ago

Having same error, any fix/solution for this?

Tsopic commented 3 years ago

I've tried to validate my PDF against multiple online PDF validators. And indeed it does not match the formatting rules. Added rule to my pre-processing validation.

export const isPDF = (buf: Buffer) => {
  return (
    Buffer.isBuffer(buf) &&
    buf.lastIndexOf("%PDF-") === 0 &&
    buf.lastIndexOf("%%EOF") > -1
  );
};
modesty commented 3 years ago

fix pushed. test with `git pull && rm -rf node_modules/ && npm i && npm run test-misc" please.

cyril23 commented 3 years ago

fix pushed

thanks a lot!

npm run test-misc

First I had to manually create the output directories via mkdir -p test/target/misc, but then the tests could be run:

$ npm run test-misc
# lots of output that I skipped here, see output.log below
Additional streams OK: 
 [
  'test/target/misc/i64_schedule_generator.content.txt',
  'test/target/misc/i64_schedule_generator.merged.json'
]

6 input files   4 success   2 fail  0 warning.
[
  '✓ Parse Success - i221_tianjin_invoice.pdf',
  '✗ Parse Exception: An error occurred while parsing the PDF: Error: Invalid XRef stream header - i243_problem_file_anon.pdf',
  '✓ Parse Success - i26_crash_18277.pdf',
  '✓ Parse Success - i28_line_break_210.pdf',
  '✗ Parse Exception: An error occurred while parsing the PDF: unsupported encryption algorithm - i43_encrypted.pdf',
  '✓ Parse Success - i64_schedule_generator.pdf'
]
pdf2json@1.2.5 [https://github.com/modesty/pdf2json]: 7.188s

Expected: 4 success, 2 failure
$ 

Full log: output.log

Tested on Ubuntu 16 LTS with Node 14

modesty commented 3 years ago

fixed in 1.2.5