smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.36k stars 536 forks source link

PNG Images with FlateDecode are corrupt #496

Open brannow opened 2 years ago

brannow commented 2 years ago

I try to extract all XObject (images) from the test pdf test.pdf

Only the not "FlateDecode" jpg are correct decoded (raw jpg data).

the other images are just 0x00... or 0xFF... byte garbage, I think maybe the plain gzuncompress

call is not enough and the DecodeParms

/DecodeParms << /Predictor 15 /Colors 1 /Columns 1200 /BitsPerComponent 8>>

must be respected too.

I found this old piece of code https://github.com/frapi/frapi/blob/ef50192b6cf336ef2c4c0fc3ad122194e3d0ecde/src/frapi/library/Zend/Pdf/Filter/Compression.php

but without any success.

ajira86 commented 1 year ago

Thanks for your analyze @k00ni, you were right (except for gzuncompress) and I found the way to decode with the frapi library source code. I'll try to write a PR when I have time. If you still need help don't hesitate to ask.

ajira86 commented 1 year ago

the other images are just 0x00... or 0xFF... byte garbage

This was not garbage but Netpbm image format (http://davis.lbl.gov/Manuals/NETPBM/doc/index.html)

r4zielrc commented 1 year ago

Thanks for your analyze @k00ni, you were right (except for gzuncompress) and I found the way to decode with the frapi library source code. I'll try to write a PR when I have time. If you still need help don't hesitate to ask.

can you share the code for decode with the frapi library?

ajira86 commented 1 year ago

@r4zielrc in short words :

  1. take the Frapi (Framework)Compression.php file mentioned by @brannow
  2. In class definition remove the abstract type and implements Zend_Pdf_Filter_Interface
  3. Replace all Zend_Pdf_Exception by standard Exception
  4. Change _applyDecodeParams from protected to public
  5. Where getting $params value in each method (lines 73, 97, 119, 142), cast variable to int and add ->getContent() at end
  6. Call Zend_Pdf_Filter_Compression::_applyDecodeParams with
    • $imageObject->getContent() as $data
    • $imageObject->getHeaders()['DecodeParams']->getElements() as $params

Do it only if DecodeParams exists, else it can be simple jpeg image which not need this transformation

aheissenberger commented 1 year ago

@ajira86 I got the Compression.php working and I convert ppm/pgm raw to GdImage but do you know how to detect ppm/pgm and the right format? I currently mapped the input to the same output formats which are created when I use the linux command line tool pdfimage from xpdf package:

if ($bitsPerComponent === 8) {
        if ($colors === 3) {
            $magic = 'P6';
            $extention = 'ppm';
        } elseif ($colors === 1) {
            $magic = 'P5';
            $extention = 'pgm';
        }
    }

for the P5 I think the relevant part could also be /DeviceGray:

<</Type /XObject
/Subtype /Image
/Width 200
/Height 200
/ColorSpace /DeviceGray
/BitsPerComponent 8
/Filter /FlateDecode
/DecodeParms <</Predictor 15 /Colors 1 /BitsPerComponent 8 /Columns 200>>
/Length 242>>

this is my P6 data:

<</Type /XObject
/Subtype /Image
/Width 200
/Height 200
/SMask 28 0 R
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Filter /FlateDecode
/DecodeParms <</Predictor 15 /Colors 3 /BitsPerComponent 8 /Columns 200>>
/Length 1090>>
stream

P5 is a Mask (/SMask 28 0 R) for P6 but this is also not supported by pdfimage and currently not relevant for my use case.

If the detection of the PPM format is confirmed I can provide a patch for the library. Currently I only have a hack which is fixing the output of the library.

ajira86 commented 1 year ago

@aheissenberger in my case I was first using PPM (P6) which don't needed to be decoded. The BitsPerComponent was present but only in general header. The DeodeParms object didn't exists for my case, so, the Colors attribute was not present PDF data .

So, it seams that DecodeParms is optional. If not present, you don't need any decoding operation to do and only prepend the missing header to raw data.

my P6 data :

<</Type /XObject
/Subtype /Image
/Width 1090
/Height 1090
/ColorSpace /DeviceGray
/BitsPerComponent 8
/Filter /FlateDecode
/Length 1174>>

my P4 data :

<</Type /XObject
/Subtype /Image
/Width 1090
/Height 1090
/ColorSpace /DeviceGray
/BitsPerComponent 1
/Filter /FlateDecode
/DecodeParms <</Columns 1090 /Colors 1 /Predictor 15 /BitsPerComponent 1>>
/Length 2028>>
ajira86 commented 1 year ago

@aheissenberger do you need any help for your pr ?

aheissenberger commented 1 year ago

@aheissenberger do you need any help for your pr ?

@ajira86 I need to find the time ;-) and will ask for help if I have a problem - Thanks :-)