smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.38k stars 536 forks source link

Does not parse text from pdf file #564

Closed mapexpert closed 1 year ago

mapexpert commented 1 year ago
$content = Storage::get('iaa/receipt1.pdf');
$parser = new \Smalot\PdfParser\Parser;
$data = $parser->parseContent($content);
dd($data->getText());

Output: b"%\t\n\x00X\x00\x00H\x00U\x00\x03\x005\x00H\x00F\x00H\x00L\x00S\x00W\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00S\x00W\x00\x03\x00\x06\t\n\x00\x14\t\n \x00\x1C\x00\x19\x00\x17\x00\x1A\x00\x19\x00\x17\x00\x13\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00S\x00W\x00\x03\x00'\x00D\x00W\x00H\t\n\x00\x14\t\n \x00\x14\x00\x12\x00\x15\x00\x16\x00\x12\x00\x15\x00\x13\x00\x15\x00\x15\t\n\x005\t\n\n\x00H\x00F\x00H\x00L\x00Y\x00H\x00G\x00\x03\x00%\x00\t\n\x00$\t\n \x006\x00$\x003\x00\x03\x006\t\n\x006\t\n\n\x00R\x00O\x00G\x00\x03\x00$\x00W\x00\x03\x00%\x00U\x00D\x00Q\x00F\x00K\t\n\x00\x19\t\n

receipt1.pdf IMHO parser does not parse font correctly and does not load translate tables.

Uplink03 commented 1 year ago

I've been having a similar problem with a PDF, but it turns out it's different from yours, @mapexpert . The problem in your case is how Font::loadTranslateTable wants to figure out the Unicode table, and it ends up with an empty table. Because of that, the Font::decodeContentByToUnicodeCMapOrDescendantFonts also fails. Not that it guarantees that it will work, as it has this comment: @todo Seems this is invalid algorithm that do not follow pdf-format specification. Must be rewritten.

My similar problem was a lot easier to fix, as it was caused by Encoding::init. If I replace this:

if ($this->has('BaseEncoding') {
    $this->encoding = EncodingLocator::getEncoding($this->getEncodingClass())->getTranslations();

    // the code that loads Differences
}

with

if ($this->has('BaseEncoding') {
    $this->encoding = EncodingLocator::getEncoding($this->getEncodingClass())->getTranslations();
}

// the code that loads Differences

which is basically to take // the code that loads Differences out of the big if block, then my problem is fixed. This seems to be the problem described in #462 .

I'm not capable enough at the moment to understand how to fix the problem you're having though.

I gave your PDF to pdf2txt.py (from the pdfminer Python project) and to pdftotext (from the poppler-utils package on Ubuntu 22.04), and they both barfed at it, while both decoded my own PDF's text just fine.

I'm writing this hoping it will give someone a hint of where to look when they make an attempt to fix this.

NazarSolovei commented 1 year ago

@mapexpert, have you solved the issue?

mapexpert commented 1 year ago

@mapexpert, have you solved the issue?

no. I did not. still have the issue

k00ni commented 1 year ago

@mapexpert can you try the suggestion of @Uplink03 and get back to us, if this works: https://github.com/smalot/pdfparser/issues/564#issuecomment-1386440468

mapexpert commented 1 year ago

I tried @Uplink03 suggestion and it does not work in my case

GreyWyvern commented 1 year ago

Definitely something weird going on here with fonts. If I save the file as a reduced size PDF in Acrobat the text issue remains. If I select all the text in the PDF in Adobe Acrobat, convert the font to Arial, then save the file, PdfParser parses the text properly.

_Edit: The translate table used by translateChar() in Font.php is indeed empty. In the loadTranslateTable() function the function to get $content ($this->get('ToUnicode')->getContent()) returns an undecoded binary string instead of the plain text the rest of the function clearly expects judging by the preg_match_all() calls. Any idea what this encoding might be?_

x�\�ϊ�@����a�=��Y�����}�ĞCc�{����ת�����L�'��U}��n�Wߧ�9�9�t};���>5)?�k�g��ۮ��>���4f����q����_�l��W?���yz�/�v8�O���Ԧ���˯��<���w��~΋L$o�e��/�����{=���n~�.g���cL����4C���Iө��lS�"ג����:�:_>|�B��t�R�y���tk�j1]��^LWk�,�t;�J1�g���n��6       ��^L*Qb1ɡ�Q !P�Q !P���JL
(mŤ����&Q� a�NLڢ�c���8�@%L�y1=2:M=2:�K���.�#�[���Ubzdt�zdt����^L���c�T������%��Zb1)���HQLFFZ���H��������j1   �RK�^̀ux3�U�h�G1Z�X�Ъ��h�cj@�^l0�'�k1�_�m�7��b,�1�ƺ.�4vb,�Ï�1�1    bFL����`�K��WbFL��bFL�u0b�
���{�߅����]���4-7���z�>/׮O�`��|��������
GreyWyvern commented 1 year ago

This appears to be fixed in the latest release v2.7.0. Although a lot of spacing issues remain, the text is extracted successfully.

ephrin commented 1 year ago

@GreyWyvern @mapexpert close this then?

mapexpert commented 1 year ago

@GreyWyvern @mapexpert close this then?

Yes