Open salmanulfaris opened 6 months ago
Without further investigation I don't think that is possible.
you can use as blow
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('./test.pdf');
$objects = $pdf->getObjects();
$html = "<html><body>";
foreach ($objects as $key => $object) {
if($object instanceof Smalot\PdfParser\XObject\Image ){
$image = $object->getContent();
$html .= "<img src='data:image/jpeg;base64," . base64_encode($image) . "' />";
}else{
$text = $object->getText();
$html .= "<div>{$text}</div>";
}
}
$html .= "</body></html>";
file_put_contents('./test_to_html.html', $html);
Careful here. There are objects of other types as well, so your else
-part is likely to run into an error. Also, Document::getObjects
might not return an ordered list. You shouldn't rely on the fact that PDFParser added objects in the same order as they appear while parsing the PDF.
Instead, you could iterate over all pages ($pdf::getPages()
) and see, if you can get images and texts from them (check Page::getText
and Page::getXObjects
). Might worth a try.
We can handle those errors, but order of the objects is very important for me, I'm scrapping PDF which is answer key of an exam, I want fetch the questions and answers from the PDF and store to DB, so Questions and options may be either text or image, so I need identify questions and it's answers from sequence of Objects
Here I'm attaching sample document Example Document.pdf
same question
Description:
I want to extract the PDF then save text to db and image to storage, but the order matters, if i take page 1, when i get an image, i need to get text coming after that.
PDF input
PDF containing some text then images in each pages,
Expected output & actual output
I need to extract the image and text in order as in the PDF How to do That ?
Code
Code I'm using for extracting the image, but text is not available here