michaelrsweet / pdfio

PDFio is a simple C library for reading and writing PDF files.
https://www.msweet.org/pdfio
Apache License 2.0
174 stars 39 forks source link

Handle per-object file identifiers for encryption #42

Open kyakuno opened 9 months ago

kyakuno commented 9 months ago

Describe the bug I got Unable to decompress stream data: Data error. from inflate. The return code of inflate is Z_DATA_ERROR.

To Reproduce Steps to reproduce the behavior:

  1. Download the pdf (https://www.axell.co.jp/business/pdf/AX51903_DS06P_hpdl202110xx.pdf)
  2. Run pdfiototext
./pdfiototext AX51903_DS06P_hpdl202110xx.pdf

Expected behavior Success to extract text.

System Information:

Additional context

st->predictor is 12 = _PDFIO_PREDICTOR_PNG_UP. The error seems to occur with PDFs that contain images.

stream st->filter 6 st->predictor 12
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 12
AX51903_DS06P_hpdl202110xx.pdf: Unable to decompress stream data: Data error.
AX51903_DS06P_hpdl202110xx.pdf: Unable to find pages object.
kyakuno commented 9 months ago

The issue occured on both Head (87ca4db 2023/10/02 18:27) and v 1.1.1.

michaelrsweet commented 9 months ago

OK, so this is an encrypted PDF generated by what looks like an old MacOS 9 version of Acrobat. The object that isn't loading is a secondary xref stream, which is odd because the primary stream loaded just fine...

Investigating...

kyakuno commented 9 months ago

Thank you very much for the investigation. I would be very happy if this file could be read.

michaelrsweet commented 8 months ago

It looks like there is a broken object reference. Need to do a little digging but I might need to allow for this and throw an error when you try to actually load the broken reference.

michaelrsweet commented 8 months ago

Looking back, the first error is the unable to decompress error due to a bogus xref stream in object 451.

michaelrsweet commented 8 months ago

and this object has a different file key than the rest of the file...

michaelrsweet commented 8 months ago

Deferring this to "future" since it will require a re-implementation of the crypto handler and I have never seen a PDF file containing two different file IDs.

michaelrsweet commented 7 months ago

Current code has an issue because the object dictionary is trying to be decrypted while it is being loaded; need to split out the code that decrypts string values from the code that loads the object dictionary.

michaelrsweet commented 6 months ago

OK, so for this file it actually looks like the per-object ID is the same as the main file ID, but the object itself is actually damaged. Xpdf doesn't ever try to load it so maybe it is an object that doesn't need to be loaded to use the file? Will be looking at that tomorrow...