Open aarjiontech opened 7 months ago
PdfParser should output UTF-8 encoded text by default, so I'm not sure what all your mb_converts and utf8_decodes after the getText()
are doing.
What's the value of $text
right after the getText()
?
PdfParser should output UTF-8 encoded text by default, so I'm not sure what all your mb_converts and utf8_decodes after the
getText()
are doing.What's the value of
$text
right after thegetText()
?
$textright after the
getText() -> same value.... output not changed..
We need to see at least some of the original output of getText()
. It has the same value of what?
We need to see at least some of the original output of
getText()
. It has the same value of what?
```
$pdf = $parser->parseFile($pdfFilePath); $pages = $pdf->getPages(); $text = $pdf->getPages()[2]->getText(); print_r($text);
The printed i got from browser as below,
ZZZFDOFKRLFHFRP 5HJXODWRU\B6WDWXVBB %HQHILWV3ODQVQRWHGDV³3HQGLQJ$SSURYDO´KDYHEHHQILOHGZLWKWKH&DOLIRUQLDUHJXODWLQJVWDWHDJHQFLHVDQGDUHSHQGLQJ DSSURYDOZLWKWKRVHVWDWHDJHQFLHV PLATINUM TIER %HQHILW3ODQ+HDOWK3ODQ1HWZRUN1DPH 5HJXODWRU\6WDWXV +02$ $QWKHP%OXH&URVV6HOHFW+02 $33529(' +02& +HDOWK1HW:KROH&DUH 3(1',1*$33529$/ +02' +HDOWK1HW6DOXG+02\0DV 3(1',1*$33529$/ +02( +HDOWK1HW)XOO 3(1',1*$33529$/ +02) +HDOWK1HW :KROH&DUH 3(1',1*$33529$/ +02* +HDOWK1HW 6DOXG+02\0DV 3(1',1*$33529$/ +02+ +HDOWK1HW )XOO 3(1',1*$33529$/ +02, +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02- +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02$ .DLVHU3HUPDQHQWH)XOO $33529(' +02% .DLVHU3HUPDQHQWH)XOO $33529(' HMO C* Kaiser Permanente Full±6PDOO*URXS$33529(' * New Plan Regulatory Status Status as of December 5, 2023 ZZZFDOFKRLFHFRP 5HJXODWRU\B6WDWXVBB %HQHILWV3ODQVQRWHGDV³3HQGLQJ$SSURYDO´KDYHEHHQILOHGZLWKWKH&DOLIRUQLDUHJXODWLQJVWDWHDJHQFLHVDQGDUHSHQGLQJ DSSURYDOZLWKWKRVHVWDWHDJHQFLHV PLATINUM TIER %HQHILW3ODQ+HDOWK3ODQ1HWZRUN1DPH 5HJXODWRU\6WDWXV +02$ $QWKHP%OXH&URVV6HOHFW+02 $33529(' +02& +HDOWK1HW:KROH&DUH 3(1',1*$33529$/ +02' +HDOWK1HW6DOXG+02\0DV 3(1',1*$33529$/ +02( +HDOWK1HW)XOO 3(1',1*$33529$/ +02) +HDOWK1HW :KROH&DUH 3(1',1*$33529$/ +02* +HDOWK1HW 6DOXG+02\0DV 3(1',1*$33529$/ +02+ +HDOWK1HW )XOO 3(1',1*$33529$/ +02, +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02- +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02$ .DLVHU3HUPDQHQWH)XOO $33529(' +02% .DLVHU3HUPDQHQWH)XOO $33529(' HMO C* Kaiser Permanente Full APPROVED +02$ 6KDUS+HDOWK3ODQ3UHPLHU $33529(' +02% 6KDUS+HDOWK3ODQ3HUIRUPDQFH $33529(' +02& 6KDUS+HDOWK3ODQ3UHPLHU $33529(' +02$ 6XWWHU+HDOWK3OXV6XWWHU+HDOWK3OXV $33529(' +02% 6XWWHU+HDOWK3OXV6XWWHU+HDOWK3OXV $33529(' +02$ 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02% 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02& 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02( 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02* 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02+ 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02, 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02- 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02. 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02/ 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +020 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +021 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02$ :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' +02% :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' +02& :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' (32& &LJQD2VFDU/RFDO3OXV 3(1',1*$33529$/ (32( &LJQD2VFDU/RFDO3OXV 3(1',1*$33529$/ (32) &LJQD2VFDU2SHQ$FFHVV3OXV 3(1',1*$33529$/ (32* &LJQD2VFDU2SHQ$FFHVV3OXV 3(1',1*$33529$/ 332$ $QWKHP%OXH&URVV3UXGHQW%X\HU±6PDOO*URXS$33529(' * New Plan Regulatory Status Status as of December 5, 2023
Hope this helps.. thanks
test1.pdf for the above pdf.. getText() has return empty Please help me If possible, i wish to return this in a table format.. so i can generate a csv/excel
test1.pdf for the above pdf.. getText() has return empty
There is no readable / editable text in the document; it is a scanned image. OCR would be required to extract the text.
test1.pdf for the above pdf.. getText() has return empty
There is no readable / editable text in the document; it is a scanned image. OCR would be required to extract the text.
Any idea to distinguish both readable text and images and do ocr extraction.. and do we have any option for ocr extraction with smalot library.. if not please help me on any ocr library for that...
Description:
PDF input
Cannot provide pdf since its confidential
Expected output & actual output
Need to extract table from it
Code
$parser = new \Smalot\PdfParser\Parser();
// Source PDF file to extract text $file = "tables 2024.pdf";
// Parse pdf file using Parser library //$pdf = $parser->parseFile($file);
$pdf = $parser->parseContent(file_get_contents($file));
// Extract text from PDF //$text = $pdf->getText(); $text = $pdf->getPages()[2]->getText(); // Add line break $pdfText = nl2br($text);
/$ascii_decoded = mb_convert_encoding($pdfText, 'UTF-8', 'ASCII'); $ansi_decoded = mb_convert_encoding($ascii_decoded, 'UTF-8', 'ISO-8859-1'); $decode1252 = mb_convert_encoding($ansi_decoded, 'UTF-8', 'Windows-1252'); $utf8_decode = utf8_decode($decode1252);/
$encodings = ['UTF-8', 'ISO-8859-1', 'Windows-1252', 'Windows-1251', 'ISO-8859-15']; $decodedText = $pdfText; foreach ($encodings as $encoding) { $decodedText = mb_convert_encoding($decodedText, 'UTF-8', $encoding); if ($decodedText) { // If decoding is successful, break the loop //break; } } $utf8_decode = utf8_decode($decodedText);
print_r($utf8_decode);
The output.. not working ZZZFDOFKRLFHFRP 5HJXODWRU\B6WDWXVBB
%HQHILWV3ODQVQRWHGDV³3HQGLQJ$SSURYDO´KDYHEHHQILOHGZLWKWKH&DOLIRUQLDUHJXODWLQJVWDWHDJHQFLHVDQGDUHSHQGLQJ DSSURYDOZLWKWKRVHVWDWHDJHQFLHV PLATINUM TIER %HQHILW3ODQ+HDOWK3ODQ1HWZRUN1DPH 5HJXODWRU\6WDWXV