smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.42k stars 538 forks source link

How Can i parse image in pdf #65

Open deepakkumar365 opened 9 years ago

deepakkumar365 commented 9 years ago

Hi, i need some help to parse image from pdf file, parsing the text is awesome and hats off to you job... i need to parse the images from the pdf file can you help me... thankx in advance...

aik099 commented 9 years ago

You can use TCPDF for that. The PdfParser is using TCPDF but only taking text nodes.

deepakkumar365 commented 9 years ago

Thank aik099... Ya that's right pdf parser only returns the text Then can you give some example --- how to use TCPDF to read image....

aik099 commented 9 years ago

I have no idea. Probably TCPDF documentation is best place to look at this.

andreiciobotar commented 8 years ago

a bit late to the party but for future reference:

$parser = new PdfParser\Parser();
$pdf = $parser->parseFile('/your/pdf/file');
$pdf->getObjectsByType('XObject', 'Image');
foreach($images as $image) {
    /** @var \Smalot\PdfParser\Object $image */
    $content = $image->getContent();
}
chikaldirick commented 8 years ago

As I understand, at least, we should save result of the method getObjectsByType, because it returns array of objects or something like that. And maybe that's why variable $images was undefined in foreach loop. If we write $images = $pdf->getObjectsByType('XObject', 'Image'); will it be correct? I thought, that then in foreach loop on each iteration in $image variable will be stored an image from PDF file and we could output it somehow, but I dont know exactly how, because $content contains a lot of symbols which cant be printed as image using imagecreatefromstring()function or something like that

cangcool commented 8 years ago

@chikaldirick @smalot @andreiciobotar have you guys found any solution to decode the $content symbols so we can print the image? I've tried many decoder/filter/functions but still failed. Thanks

jetonr commented 8 years ago

This is how I did it :

$parser = new \Smalot\PdfParser\Parser(); 
$pdf = $parser->parseFile('/your/pdf/file');

$images = $pdf->getObjectsByType('XObject', 'Image');

foreach( $images as $image ) {
    echo '<img src="data:image/jpg;base64,'. base64_encode($image->getContent()) .'" />';
}
cangcool commented 8 years ago

@jetonr thank you, it works.. but some images failed to be printed, I guess it has incorrect $image->getContent() result. screenshot from 2016-08-10 05 58 59-edit