maxpmaxp / pdfreader

Python API for PDF documents
MIT License
117 stars 27 forks source link

DCT Decoding Error #101

Closed admercs closed 7 months ago

admercs commented 2 years ago

I'm getting the following error:

$ document.pdf
ERROR:root:Partially decoded. Filters applied: []
Traceback (most recent call last):
  File "/HOME/quicksand/lib/python3.6/site-packages/pdfreader/types/native.py", line 55, in apply_filter_multi
    binary = apply_filter(fname, binary, params)
  File "/HOME/quicksand/lib/python3.6/site-packages/pdfreader/filters/__init__.py", line 14, in apply_filter
    return decoder.decode(binary, params or {})
  File "/HOME/quicksand/lib/python3.6/site-packages/pdfreader/filters/dct.py", line 5, in decode
    raise NotImplementedError('DCTDecode')
NotImplementedError: DCTDecode

Any idea how to resolve it?

maxpmaxp commented 2 years ago

DCT decoder is not supported at this point. Feel free to contribute.

maxpmaxp commented 2 years ago

@admercs can you share the file please? I can try to add the decoder.

admercs commented 2 years ago

I cannot, sorry.

On Sat, Nov 5, 2022 at 15:25 Maksym Polshcha @.***> wrote:

@admercs https://github.com/admercs can you share the file please? I can try to add the decoder.

— Reply to this email directly, view it on GitHub https://github.com/maxpmaxp/pdfreader/issues/101#issuecomment-1304617050, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABL4FIFI7M2WKOAORFAYGJTWG2YEPANCNFSM6AAAAAAQONW3GQ . You are receiving this because you were mentioned.Message ID: @.***>

canbolukbas commented 7 months ago

What would you think about not raising an error, but returning the same bytes back to the caller?

pypdf library does this : https://github.com/py-pdf/pypdf/blob/e92b20e0b35e4feb5a2a7f347de7a4c3f713011a/pypdf/filters.py#L510

LMK if you want me to create the MR, I'd be happy to contribute.

maxpmaxp commented 7 months ago

@canbolukbas

Raw stream data can be accessed directly for any Stream object, use obj.stream instead of obj.filtered . See https://github.com/maxpmaxp/pdfreader/blob/fb8189a879ada76b970ee2c409fe169e7d79a92b/pdfreader/types/native.py#L88-L89

This should work for any Image object, as technically it's a descendant of Stream.

As for the suggestion to return raw data with unimplemented filters - I see pros and cons. Ideally we need to have this decoder implemented. Feel free to create a PR and contribute.

maxpmaxp commented 7 months ago

@canbolukbas can you also attach your file please? I don't have PDFs with DCT streams. Thanks!

maxpmaxp commented 7 months ago

Just realized that it's a very trivial patch. It's on master. The support added on #132

mara004 commented 7 months ago

can you also attach your file please? I don't have PDFs with DCT streams.

For what it's worth, DCT corresponds to JPEG, so should be trivial to create a sample. Just run img2pdf on an arbitrary JPEG image from the web, or drag one into Libreoffice and export to PDF. If you have some PDFs on your disk, it's quite likely there will be one with DCT, as it's basically the most common PDF image encoding.