Add support for rebuilding the xref table for damaged PDF files - Githubissues

michaelrsweet / pdfio

PDFio is a simple C library for reading and writing PDF files.

https://www.msweet.org/pdfio

Apache License 2.0

198 stars 44 forks source link

Add support for rebuilding the xref table for damaged PDF files #45

Open kleuter opened 1 year ago

kleuter commented 1 year ago

The pdfiototext tool fails to parse the file: https://www.dropbox.com/scl/fi/1nhivpa3sbjejza8l53rz/NTFS.pdf?rlkey=zvphkczuy71b0vil8zvmrz95v&dl=0

System Information:

OS: Windows 10, Visual Studio 2019

michaelrsweet commented 1 year ago

So for this file the "startxref" value is wrong, as are all of the xref table offsets. More than likely the original file was edited on Windows with a plain text editor (Notepad or similar) which changed the line endings from LF only to CR LF.

Some PDF viewers will attempt to generate their own xref value for files like this, but I have not done so for PDFio due to the chances for errors and the likelihood that such corruption will also damage the binary streams in the file, making it unreadable that way... I will keep this issue open for now but it will not be "fixed" any time soon...

michaelrsweet commented 1 year ago

kleuter commented 1 year ago

Here's another pdf, newly generated so unlikely to be damaged. https://www.dropbox.com/scl/fi/ecfzyrskea5nhl8phhdsb/eFFF_BE0445890588_202300011.pdf?rlkey=kwx7cb2msd06bonedzdslt6sj&dl=0

kleuter commented 1 year ago

Bad xref table header 'xref '.

michaelrsweet commented 1 year ago

That file isn't damaged in the same way; in fact, the issue is that there is trailing whitespace after the "xref" keyword but the current parser won't allow it since the PDF specs all say the xref table starts with a line consisting of a single "xref" keyword and doesn't talk about extra whitespace, etc.

So I will update the xref loading code to allow for this but it won't fix the problem with the first file you linked to...

michaelrsweet commented 1 year ago

[master b0a66ee] Fix reading of PDF files from Crystal Reports (Issue #45)

michaelrsweet commented 1 year ago

If you find other files with issues, please report them as separate issues, otherwise it makes it harder for me to track when a problem is actually fixed... Thanks!

kleuter commented 1 year ago

Will do, thanks a lot, Michael. though the fix doesn't seem to work 😢

michaelrsweet commented 11 months ago

PDFBOX-2250-0.pdf