smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Font Fallback Issue #657

Open paytah232 opened 6 months ago

paytah232 commented 6 months ago

Description:

PDF input

Personal payslip, so unable to provide, but will do what I can

Expected output & actual output

Get text seems to work, although there is some odd encoding here or there. When trying to run getDataTm, it fails - seems it's due to a font issue.

Fatal error: Uncaught TypeError: Smalot\PdfParser\PDFObject::getTJUsingFontFallback(): Argument #1 ($font) must be of type Smalot\PdfParser\Font, null given, called in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 531 and defined in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 252 TypeError: Smalot\PdfParser\PDFObject::getTJUsingFontFallback(): Argument #1 ($font) must be of type Smalot\PdfParser\Font, null given, called in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 531 in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 252 Call Stack: 0.0019 370824 1. {main}() /volume1/web/devel/scripts/testing/pdf.php:0 0.1753 1337680 2. Smalot\PdfParser\Page->getDataTm($dataCommands = ???) /volume1/web/devel/scripts/testing/pdf.php:25 0.1861 1510200 3. Smalot\PdfParser\Page->getTextArray($page = ???) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php:701 0.1861 1547256 4. Smalot\PdfParser\PDFObject->getTextArray($page = class Smalot\PdfParser\Page { protected $document = class Smalot\PdfParser\Document { protected $objects = [...]; protected $dictionary = [...]; protected $trailer = class Smalot\PdfParser\Header { ... }; protected $metadata = [...]; protected $details = [...] }; protected $header = class Smalot\PdfParser\Header { protected $document = class Smalot\PdfParser\Document { ... }; protected $elements = [...] }; protected $content = ''; protected $config = class Smalot\PdfParser\Config { private $fontSpaceLimit = -50; private $horizontalOffset = ' '; private $pdfWhitespaces = '\000\t\n\f\r '; private $pdfWhitespacesRegex = '[\\0\\t\\n\\f\\r ]'; private $retainImageContent = TRUE; private $decodeMemoryLimit = 0; private $dataTmFontInfoHasToBeIncluded = TRUE }; protected $fonts = []; protected $xobjects = NULL; protected $dataTm = NULL }) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php:365 0.1900 1578944 5. Smalot\PdfParser\PDFObject->getTJUsingFontFallback($font = NULL, $command = [0 => ['t' => '(', 'o' => '\'', 'c' => '\000,']], $page = class Smalot\PdfParser\Page { protected $document = class Smalot\PdfParser\Document { protected $objects = [...]; protected $dictionary = [...]; protected $trailer = class Smalot\PdfParser\Header { ... }; protected $metadata = [...]; protected $details = [...] }; protected $header = class Smalot\PdfParser\Header { protected $document = class Smalot\PdfParser\Document { ... }; protected $elements = [...] }; protected $content = ''; protected $config = class Smalot\PdfParser\Config { private $fontSpaceLimit = -50; private $horizontalOffset = ' '; private $pdfWhitespaces = '\000\t\n\f\r '; private $pdfWhitespacesRegex = '[\\0\\t\\n\\f\\r ]'; private $retainImageContent = TRUE; private $decodeMemoryLimit = 0; private $dataTmFontInfoHasToBeIncluded = TRUE }; protected $fonts = []; protected $xobjects = NULL; protected $dataTm = NULL }) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php:531

It does work on another invoice I have, just not this payslip.

Code

`
use Smalot\PdfParser\Parser;
use Smalot\PdfParser\Config;

$config = new Config();
$config->setDataTmFontInfoHasToBeIncluded(true);
$parser = new Parser([], $config);

$pdf = $parser->parseFile('paySlip.pdf');
//$pdf = $parser->parseFile('Invoice INV-0007.pdf');

$text = $pdf->getText();

$debugger->force_out($text, 'Text');

$metaData = $pdf->getDetails();

$debugger->force_out($metaData, 'Meta');

$pages = $pdf->getPages();
$debugger->force_out($pages);

$pos = $pdf->getPages()[0]->getDataTm();

$debugger->force_out($pos, 'Data'); `
k00ni commented 6 months ago

@GreyWyvern this one may interests you.

I was just thinking to make the mentioned parameter of getTJUsingFontFallback also accepting null. But further research might be needed here.

GreyWyvern commented 6 months ago

It would be useful to see the data from the PDF in question. Any of a number of things might be happening. The document might be trying to define a font that PdfParser doesn't accept, or a mismatched set of q and Q commands are leading to a null value for the current font, or... it could be a lot of things.

I would definitely want to see what was happening before allowing getTJUsingFontFallback to accept a null value. It should always be a valid font in the current context when it's called. Allowing null might fix the issue, but it would be akin to putting a band-aid on the problem instead of fixing it at the source.

paytah232 commented 6 months ago

@GreyWyvern - I understand, but as stated, the PDF in question is my payslip, and I wouldn't be comfortable sharing that document. Perhaps I can try and edit some key values and see if the issue still exists, then I would be happy to share. I'll try and come back to you.

bleigh-gemnisw commented 5 months ago

EDIT: And of course now its working, so no clue what was wrong before. But it does happen on other documents, which I also can not share.

Perhaps I can help. I have the same issue with the output.pdf very simple pdf file attached.

GreyWyvern commented 5 months ago

EDIT: And of course now its working, so no clue what was wrong before. But it does happen on other documents, which I also can not share.

Yep, your file is working for me too in 2.8.0-RC2. :( If you can figure out how to get it to display the error using a PDF you can post, please share!

thomasage commented 3 months ago

Hi! I have the same issue. After re-opening the file in Adobe and save it again, the error has gone. I can provide the 2 files (with error and without error). I hope it can help. file-error.pdf file-success.pdf

GreyWyvern commented 3 months ago

Hi! I have the same issue. After re-opening the file in Adobe and save it again, the error has gone. I can provide the 2 files (with error and without error). I hope it can help.

Running getDataTm() on both files gives output without any errors for me in 2.9.0.

thomasage commented 3 months ago

I just tried it and you're right. I don't know what happened. I'll post a new comment with more details if it happens again.

k00ni commented 3 months ago

Is this issue solved now? @bleigh-gemnisw and @paytah232, please give us a short ping.

bleigh-gemnisw commented 3 months ago

@k00ni I still have files that it occurs in but unfortunately cannot share them for troubleshooting.

I'm of the opinion that your previous suggestion:

"I was just thinking to make the mentioned parameter of getTJUsingFontFallback also accepting null. But further research might be needed here."

Is the solution. It allows files with the problem to not error out without having to know what's wrong with their font and shouldn't interfere with anything else as long as downstream code is made to handle the same condition.

Then I can deal with those files as needed on the backend analyzing the produced json (i.e. giving it a default or replacing whatever bad font is causing it). As it stands I can't process those files at all.

GreyWyvern commented 3 months ago

I suspect this might be another inline image issue, the same as #691, where binary image data containing 'q' or 'Q' is unbalancing the stored state of the document, which includes fonts.

@bleigh-gemnisw if it is at all possible to send the affected PDFs to bhuisman at greywyvern dot com so I can verify this privately, I'd appreciate it.