smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Parsing with unknown text. Help me resolve #697

Open aarjiontech opened 3 months ago

aarjiontech commented 3 months ago

Description:

PDF input

Cannot provide pdf since its confidential

Expected output & actual output

Need to extract table from it

Code

$parser = new \Smalot\PdfParser\Parser();

// Source PDF file to extract text $file = "tables 2024.pdf";

// Parse pdf file using Parser library //$pdf = $parser->parseFile($file);

$pdf = $parser->parseContent(file_get_contents($file));

// Extract text from PDF //$text = $pdf->getText(); $text = $pdf->getPages()[2]->getText(); // Add line break $pdfText = nl2br($text);

/$ascii_decoded = mb_convert_encoding($pdfText, 'UTF-8', 'ASCII'); $ansi_decoded = mb_convert_encoding($ascii_decoded, 'UTF-8', 'ISO-8859-1'); $decode1252 = mb_convert_encoding($ansi_decoded, 'UTF-8', 'Windows-1252'); $utf8_decode = utf8_decode($decode1252);/

$encodings = ['UTF-8', 'ISO-8859-1', 'Windows-1252', 'Windows-1251', 'ISO-8859-15']; $decodedText = $pdfText; foreach ($encodings as $encoding) { $decodedText = mb_convert_encoding($decodedText, 'UTF-8', $encoding); if ($decodedText) { // If decoding is successful, break the loop //break; } } $utf8_decode = utf8_decode($decodedText);

print_r($utf8_decode);

The output.. not working  ZZZFDOFKRLFHFRP  5HJXODWRU\B6WDWXVBB

%HQHILWV3ODQVQRWHGDV³3HQGLQJ$SSURYDO´KDYHEHHQILOHGZLWKWKH&DOLIRUQLDUHJXODWLQJVWDWHDJHQFLHVDQGDUHSHQGLQJ DSSURYDOZLWKWKRVHVWDWHDJHQFLHV  PLATINUM TIER %HQHILW3ODQ+HDOWK3ODQ1HWZRUN1DPH 5HJXODWRU\6WDWXV

GreyWyvern commented 3 months ago

PdfParser should output UTF-8 encoded text by default, so I'm not sure what all your mb_converts and utf8_decodes after the getText() are doing.

What's the value of $text right after the getText() ?

aarjiontech commented 2 months ago

PdfParser should output UTF-8 encoded text by default, so I'm not sure what all your mb_converts and utf8_decodes after the getText() are doing.

What's the value of $text right after the getText() ?

$textright after thegetText() -> same value.... output not changed..

k00ni commented 2 months ago

We need to see at least some of the original output of getText(). It has the same value of what?

aarjiontech commented 2 months ago

We need to see at least some of the original output of getText(). It has the same value of what?

```

$pdf = $parser->parseFile($pdfFilePath); $pages = $pdf->getPages(); $text = $pdf->getPages()[2]->getText(); print_r($text);



The printed i got from browser as below,
 ZZZFDOFKRLFHFRP  5HJXODWRU\B6WDWXVBB %HQHILWV3ODQVQRWHGDV³3HQGLQJ$SSURYDO´KDYHEHHQILOHGZLWKWKH&DOLIRUQLDUHJXODWLQJVWDWHDJHQFLHVDQGDUHSHQGLQJ DSSURYDOZLWKWKRVHVWDWHDJHQFLHV  PLATINUM TIER %HQHILW3ODQ+HDOWK3ODQ1HWZRUN1DPH 5HJXODWRU\6WDWXV +02$ $QWKHP%OXH&URVV6HOHFW+02 $33529(' +02& +HDOWK1HW:KROH&DUH 3(1',1*$33529$/ +02' +HDOWK1HW6DOXG+02\0DV 3(1',1*$33529$/ +02( +HDOWK1HW)XOO 3(1',1*$33529$/ +02) +HDOWK1HW :KROH&DUH 3(1',1*$33529$/ +02* +HDOWK1HW 6DOXG+02\0DV 3(1',1*$33529$/ +02+ +HDOWK1HW )XOO 3(1',1*$33529$/ +02, +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02- +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02$ .DLVHU3HUPDQHQWH)XOO $33529(' +02% .DLVHU3HUPDQHQWH)XOO $33529(' HMO C* Kaiser Permanente Full APPROVED +02$ 6KDUS+HDOWK3ODQ3UHPLHU $33529(' +02% 6KDUS+HDOWK3ODQ3HUIRUPDQFH $33529(' +02& 6KDUS+HDOWK3ODQ3UHPLHU $33529(' +02$ 6XWWHU+HDOWK3OXV6XWWHU+HDOWK3OXV $33529(' +02% 6XWWHU+HDOWK3OXV6XWWHU+HDOWK3OXV $33529(' +02$ 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02% 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02& 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02( 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02* 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02+ 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02, 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02- 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02. 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02/ 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +020 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +021 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02$ :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' +02% :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' +02& :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' (32& &LJQD2VFDU/RFDO3OXV 3(1',1*$33529$/ (32( &LJQD2VFDU/RFDO3OXV 3(1',1*$33529$/ (32) &LJQD2VFDU2SHQ$FFHVV3OXV 3(1',1*$33529$/ (32* &LJQD2VFDU2SHQ$FFHVV3OXV 3(1',1*$33529$/ 332$ $QWKHP%OXH&URVV3UXGHQW%X\HU±6PDOO*URXS$33529(' * New Plan  Regulatory Status Status as of December 5, 2023  ZZZFDOFKRLFHFRP  5HJXODWRU\B6WDWXVBB %HQHILWV3ODQVQRWHGDV³3HQGLQJ$SSURYDO´KDYHEHHQILOHGZLWKWKH&DOLIRUQLDUHJXODWLQJVWDWHDJHQFLHVDQGDUHSHQGLQJ DSSURYDOZLWKWKRVHVWDWHDJHQFLHV  PLATINUM TIER %HQHILW3ODQ+HDOWK3ODQ1HWZRUN1DPH 5HJXODWRU\6WDWXV +02$ $QWKHP%OXH&URVV6HOHFW+02 $33529(' +02& +HDOWK1HW:KROH&DUH 3(1',1*$33529$/ +02' +HDOWK1HW6DOXG+02\0DV 3(1',1*$33529$/ +02( +HDOWK1HW)XOO 3(1',1*$33529$/ +02) +HDOWK1HW :KROH&DUH 3(1',1*$33529$/ +02* +HDOWK1HW 6DOXG+02\0DV 3(1',1*$33529$/ +02+ +HDOWK1HW )XOO 3(1',1*$33529$/ +02, +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02- +HDOWK1HW 6PDUW&DUH 3(1',1*$33529$/ +02$ .DLVHU3HUPDQHQWH)XOO $33529(' +02% .DLVHU3HUPDQHQWH)XOO $33529(' HMO C* Kaiser Permanente Full APPROVED +02$ 6KDUS+HDOWK3ODQ3UHPLHU $33529(' +02% 6KDUS+HDOWK3ODQ3HUIRUPDQFH $33529(' +02& 6KDUS+HDOWK3ODQ3UHPLHU $33529(' +02$ 6XWWHU+HDOWK3OXV6XWWHU+HDOWK3OXV $33529(' +02% 6XWWHU+HDOWK3OXV6XWWHU+HDOWK3OXV $33529(' +02$ 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02% 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02& 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02( 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +02* 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02+ 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02, 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02- 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02. 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +02/ 8QLWHG+HDOWKFDUH6LJQDWXUH9DOXH 3(1',1*$33529$/ +020 8QLWHG+HDOWKFDUH+DUPRQ\ 3(1',1*$33529$/ +021 8QLWHG+HDOWKFDUH$OOLDQFH 3(1',1*$33529$/ +02$ :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' +02% :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' +02& :HVWHUQ+HDOWK$GYDQWDJH)XOO $33529(' (32& &LJQD2VFDU/RFDO3OXV 3(1',1*$33529$/ (32( &LJQD2VFDU/RFDO3OXV 3(1',1*$33529$/ (32) &LJQD2VFDU2SHQ$FFHVV3OXV 3(1',1*$33529$/ (32* &LJQD2VFDU2SHQ$FFHVV3OXV 3(1',1*$33529$/ 332$ $QWKHP%OXH&URVV3UXGHQW%X\HU±6PDOO*URXS$33529(' * New Plan  Regulatory Status Status as of December 5, 2023

Hope this helps.. thanks
aarjiontech commented 2 months ago

test1.pdf for the above pdf.. getText() has return empty Please help me If possible, i wish to return this in a table format.. so i can generate a csv/excel

GreyWyvern commented 2 months ago

test1.pdf for the above pdf.. getText() has return empty

There is no readable / editable text in the document; it is a scanned image. OCR would be required to extract the text.

aarjiontech commented 2 months ago

test1.pdf for the above pdf.. getText() has return empty

There is no readable / editable text in the document; it is a scanned image. OCR would be required to extract the text.

Any idea to distinguish both readable text and images and do ocr extraction.. and do we have any option for ocr extraction with smalot library.. if not please help me on any ocr library for that...