smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.

GNU Lesser General Public License v3.0

2.42k stars 538 forks source link

Call to undefined method Smalot\PdfParser\Encoding::__toString() #364

Closed rubas closed 3 years ago

rubas commented 4 years ago

We are seeing a lot of uncatched errors, when we try to extract the content of some pdfs.

Encoding::__toString()

Call to undefined method Smalot\PdfParser\Encoding::__toString()

You find the complete stack trace here. The char is \.

if (\strlen($char) < 2 && $this->has('Encoding') && 'WinAnsiEncoding' === $this->get('Encoding')->__toString()) {
    $fallbackDecoded = self::uchr($dec);
 }

https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/Font.php#L104

Header::__toString()

Call to undefined method Smalot\PdfParser\Header::__toString() You find the complete stack trace here. The char is !.

Code

Our code is simple.

use Smalot\PdfParser\Parser;

$content = file_get_contents($url);
...
$parser = new Parser();
$pdf    = $parser->parseContent($content);

return $pdf->getText();

Testfiles

10-12.pdf 12-14.pdf 28-32-2.pdf

k00ni commented 4 years ago

Thank you for your detailed bug report.

clicksistema commented 4 years ago

the function __toString is missing on class Encoding I've created it to return an implode of the object for test and the error stoped

k00ni commented 4 years ago

Can you paste your fix here?

clicksistema commented 4 years ago

I've insert this function to class Encoding:

    public function __toString()
    {
        return implode(',',$this->encoding);
    }

Just to be clear that i didn't check for what this class is used. I just created a function that works and was not founded before. I belive that most times this class is not returned as a object of HEADER class but when HEADER has one object of this class the error occurs Maybe the error is deeper of contest. Why sometimes this class is part of HEADER class?

johnyboom commented 4 years ago

Hi, I have the same issue. But sadly the fix removes the error not the problem. If you have one rouge character in a file no big deal but some of the files, I need to parse, are almost entirely unreadable.

pd120320.pdf

TISKOVÁ ZPRÁVA Centrum pro výzkum veejného mínní Sociologický ústav AV R, v.v.i. !"# $%&'%()&&&* etc.

Despite this, the majority of files are parsed nicely so great work.

k00ni commented 4 years ago

@johnyboom: Is the PDF you posted free to use and without obligations? We may add it to our test environment to test potential fixes.

johnyboom commented 4 years ago

Well, it is a public document but to be sure I'll ask for consent.

https://cvvm.soc.cas.cz/media/com_form2content/documents/c2/a47/f9/pd120320.pdf

johnyboom commented 4 years ago

@johnyboom: Is the PDF you posted free to use and without obligations? We may add it to our test environment to test potential fixes.

Ok, we have consent to use it for tests. I've forwarded the details to your email.

k00ni commented 4 years ago

The following consent was given for the mentioned PDF file:

We are giving consent to https://www.pdfparser.org to freely use pdf file https://cvvm.soc.cas.cz/media/com_form2content/documents/c2/a47/f9/pd120320.pdf for testing purposes. Content of the file is still intellectual property of "CENTRUM PRO VÝZKUM VEŘEJNÉHO MÍNĚNÍ Sociologický ústav AV ČR, v.v.i." and should be handled according to https://cvvm.soc.cas.cz/cz/cvvm/dokumenty/13-pravni-ujednani.

If someone wants to provide a fix and using this file to check, please include my quoted consent as it is and add it to the code part (with test code) which uses the PDF.