smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

How to get images and text in order as in PDF? #705

Open salmanulfaris opened 2 months ago

salmanulfaris commented 2 months ago

Description:

I want to extract the PDF then save text to db and image to storage, but the order matters, if i take page 1, when i get an image, i need to get text coming after that.

PDF input

PDF containing some text then images in each pages,

Expected output & actual output

I need to extract the image and text in order as in the PDF How to do That ?

Code

Code I'm using for extracting the image, but text is not available here

$parser = new Parser();
$pdf = $parser->parseFile(public_path('paper.pdf'));
$objects = $pdf->getObjects();
foreach ($objects as $key => $object) {
      echo '<img src="data:image/jpg;base64,'. base64_encode($object->getContent()) .'" />';
}
k00ni commented 1 month ago

Without further investigation I don't think that is possible.

azwhale commented 1 month ago

you can use as blow

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('./test.pdf');
$objects = $pdf->getObjects();
$html = "<html><body>";

foreach ($objects as $key => $object) {
    if($object instanceof Smalot\PdfParser\XObject\Image ){
        $image = $object->getContent();
        $html .= "<img src='data:image/jpeg;base64," . base64_encode($image) . "' />";
    }else{
        $text =  $object->getText();
        $html .= "<div>{$text}</div>";
    }
}
$html .= "</body></html>";
file_put_contents('./test_to_html.html', $html);
k00ni commented 1 month ago

Careful here. There are objects of other types as well, so your else-part is likely to run into an error. Also, Document::getObjects might not return an ordered list. You shouldn't rely on the fact that PDFParser added objects in the same order as they appear while parsing the PDF.

Instead, you could iterate over all pages ($pdf::getPages()) and see, if you can get images and texts from them (check Page::getText and Page::getXObjects). Might worth a try.

salmanulfaris commented 1 month ago

We can handle those errors, but order of the objects is very important for me, I'm scrapping PDF which is answer key of an exam, I want fetch the questions and answers from the PDF and store to DB, so Questions and options may be either text or image, so I need identify questions and it's answers from sequence of Objects

Here I'm attaching sample document Example Document.pdf