Closed cyril23 closed 3 years ago
Seems like a corrupted XREF table and stream lengths. Can you fire the pdfParser_dataError
in that case, too, please?
edit: using qpdf on one of the problem files (original ones):
using GhostScript:
GPL Ghostscript 9.26 (2018-11-20)
Copyright (C) 2018 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Error: An error occurred while reading an XREF table.
**** The file has been damaged. This may have been caused
**** by a problem while converting or transfering the file.
**** Ghostscript will attempt to recover the data.
**** However, the output may be incorrect.
Processing pages 1 through 3.
Page 1
**** Error: stream Length incorrect.
Output may be incorrect.
**** Error: stream Length incorrect.
Output may be incorrect.
**** Error: stream Length incorrect.
Output may be incorrect.
Page 2
**** Error: stream Length incorrect.
Output may be incorrect.
**** Error: stream Length incorrect.
Output may be incorrect.
**** Error: stream Length incorrect.
Output may be incorrect.
Page 3
**** Error: stream Length incorrect.
Output may be incorrect.
**** Error: stream Length incorrect.
Output may be incorrect.
**** Error: stream Length incorrect.
Output may be incorrect.
**** This file had errors that were repaired or ignored.
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
**** The rendered output from this file may be incorrect.
GS>
edit: using GhostScript with -dPDFDEBUG
param: output-detail.txt
edit: ... or even better, try to fix the file errors or ignore them where possible
Having same error, any fix/solution for this?
I've tried to validate my PDF against multiple online PDF validators. And indeed it does not match the formatting rules. Added rule to my pre-processing validation.
export const isPDF = (buf: Buffer) => {
return (
Buffer.isBuffer(buf) &&
buf.lastIndexOf("%PDF-") === 0 &&
buf.lastIndexOf("%%EOF") > -1
);
};
fix pushed. test with `git pull && rm -rf node_modules/ && npm i && npm run test-misc" please.
fix pushed
thanks a lot!
npm run test-misc
First I had to manually create the output directories via mkdir -p test/target/misc
, but then the tests could be run:
$ npm run test-misc
# lots of output that I skipped here, see output.log below
Additional streams OK:
[
'test/target/misc/i64_schedule_generator.content.txt',
'test/target/misc/i64_schedule_generator.merged.json'
]
6 input files 4 success 2 fail 0 warning.
[
'✓ Parse Success - i221_tianjin_invoice.pdf',
'✗ Parse Exception: An error occurred while parsing the PDF: Error: Invalid XRef stream header - i243_problem_file_anon.pdf',
'✓ Parse Success - i26_crash_18277.pdf',
'✓ Parse Success - i28_line_break_210.pdf',
'✗ Parse Exception: An error occurred while parsing the PDF: unsupported encryption algorithm - i43_encrypted.pdf',
'✓ Parse Success - i64_schedule_generator.pdf'
]
pdf2json@1.2.5 [https://github.com/modesty/pdf2json]: 7.188s
Expected: 4 success, 2 failure
$
Full log: output.log
Tested on Ubuntu 16 LTS with Node 14
fixed in 1.2.5
Information
"pdfParser_dataError"
when handling invalid PDFs, see below. And this can be replicated using my "manually edited" PDF file, too, see belowExpected behavior
"pdfParser_dataError"
event is fired, so I can handle the error and inform the user about the problematic PDF fileActual behavior
Info: similar error my manually edited PDF which I uploaded here:
Example Code