paytah232 commented 11 months ago

PHP Version: 8.2.5
PDFParser Version: 2.7.0

Description:

PDF input

Personal payslip, so unable to provide, but will do what I can

Expected output & actual output

Get text seems to work, although there is some odd encoding here or there. When trying to run getDataTm, it fails - seems it's due to a font issue.

Fatal error: Uncaught TypeError: Smalot\PdfParser\PDFObject::getTJUsingFontFallback(): Argument #1 ($font) must be of type Smalot\PdfParser\Font, null given, called in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 531 and defined in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 252 TypeError: Smalot\PdfParser\PDFObject::getTJUsingFontFallback(): Argument #1 ($font) must be of type Smalot\PdfParser\Font, null given, called in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 531 in /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php on line 252 Call Stack: 0.0019 370824 1. {main}() /volume1/web/devel/scripts/testing/pdf.php:0 0.1753 1337680 2. Smalot\PdfParser\Page->getDataTm($dataCommands = ???) /volume1/web/devel/scripts/testing/pdf.php:25 0.1861 1510200 3. Smalot\PdfParser\Page->getTextArray($page = ???) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php:701 0.1861 1547256 4. Smalot\PdfParser\PDFObject->getTextArray($page = class Smalot\PdfParser\Page { protected $document = class Smalot\PdfParser\Document { protected $objects = [...]; protected $dictionary = [...]; protected $trailer = class Smalot\PdfParser\Header { ... }; protected $metadata = [...]; protected $details = [...] }; protected $header = class Smalot\PdfParser\Header { protected $document = class Smalot\PdfParser\Document { ... }; protected $elements = [...] }; protected $content = ''; protected $config = class Smalot\PdfParser\Config { private $fontSpaceLimit = -50; private $horizontalOffset = ' '; private $pdfWhitespaces = '\000\t\n\f\r '; private $pdfWhitespacesRegex = '[\\0\\t\\n\\f\\r ]'; private $retainImageContent = TRUE; private $decodeMemoryLimit = 0; private $dataTmFontInfoHasToBeIncluded = TRUE }; protected $fonts = []; protected $xobjects = NULL; protected $dataTm = NULL }) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/Page.php:365 0.1900 1578944 5. Smalot\PdfParser\PDFObject->getTJUsingFontFallback($font = NULL, $command = [0 => ['t' => '(', 'o' => '\'', 'c' => '\000,']], $page = class Smalot\PdfParser\Page { protected $document = class Smalot\PdfParser\Document { protected $objects = [...]; protected $dictionary = [...]; protected $trailer = class Smalot\PdfParser\Header { ... }; protected $metadata = [...]; protected $details = [...] }; protected $header = class Smalot\PdfParser\Header { protected $document = class Smalot\PdfParser\Document { ... }; protected $elements = [...] }; protected $content = ''; protected $config = class Smalot\PdfParser\Config { private $fontSpaceLimit = -50; private $horizontalOffset = ' '; private $pdfWhitespaces = '\000\t\n\f\r '; private $pdfWhitespacesRegex = '[\\0\\t\\n\\f\\r ]'; private $retainImageContent = TRUE; private $decodeMemoryLimit = 0; private $dataTmFontInfoHasToBeIncluded = TRUE }; protected $fonts = []; protected $xobjects = NULL; protected $dataTm = NULL }) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php:531

It does work on another invoice I have, just not this payslip.

Code

`
use Smalot\PdfParser\Parser;
use Smalot\PdfParser\Config;

$config = new Config();
$config->setDataTmFontInfoHasToBeIncluded(true);
$parser = new Parser([], $config);

$pdf = $parser->parseFile('paySlip.pdf');
//$pdf = $parser->parseFile('Invoice INV-0007.pdf');

$text = $pdf->getText();

$debugger->force_out($text, 'Text');

$metaData = $pdf->getDetails();

$debugger->force_out($metaData, 'Meta');

$pages = $pdf->getPages();
$debugger->force_out($pages);

$pos = $pdf->getPages()[0]->getDataTm();

$debugger->force_out($pos, 'Data'); `

k00ni commented 11 months ago

@GreyWyvern this one may interests you.

I was just thinking to make the mentioned parameter of getTJUsingFontFallback also accepting null. But further research might be needed here.

GreyWyvern commented 11 months ago

It would be useful to see the data from the PDF in question. Any of a number of things might be happening. The document might be trying to define a font that PdfParser doesn't accept, or a mismatched set of q and Q commands are leading to a null value for the current font, or... it could be a lot of things.

I would definitely want to see what was happening before allowing getTJUsingFontFallback to accept a null value. It should always be a valid font in the current context when it's called. Allowing null might fix the issue, but it would be akin to putting a band-aid on the problem instead of fixing it at the source.

paytah232 commented 11 months ago

@GreyWyvern - I understand, but as stated, the PDF in question is my payslip, and I wouldn't be comfortable sharing that document. Perhaps I can try and edit some key values and see if the issue still exists, then I would be happy to share. I'll try and come back to you.

bleigh-gemnisw commented 10 months ago

EDIT: And of course now its working, so no clue what was wrong before. But it does happen on other documents, which I also can not share.

Perhaps I can help. I have the same issue with the output.pdf very simple pdf file attached.

GreyWyvern commented 10 months ago

EDIT: And of course now its working, so no clue what was wrong before. But it does happen on other documents, which I also can not share.

Yep, your file is working for me too in 2.8.0-RC2. :( If you can figure out how to get it to display the error using a PDF you can post, please share!

thomasage commented 8 months ago

Hi! I have the same issue. After re-opening the file in Adobe and save it again, the error has gone. I can provide the 2 files (with error and without error). I hope it can help. file-error.pdf file-success.pdf

GreyWyvern commented 8 months ago

Hi! I have the same issue. After re-opening the file in Adobe and save it again, the error has gone. I can provide the 2 files (with error and without error). I hope it can help.

Running getDataTm() on both files gives output without any errors for me in 2.9.0.

thomasage commented 8 months ago

I just tried it and you're right. I don't know what happened. I'll post a new comment with more details if it happens again.

k00ni commented 8 months ago

Is this issue solved now? @bleigh-gemnisw and @paytah232, please give us a short ping.

bleigh-gemnisw commented 8 months ago

@k00ni I still have files that it occurs in but unfortunately cannot share them for troubleshooting.

I'm of the opinion that your previous suggestion:

"I was just thinking to make the mentioned parameter of getTJUsingFontFallback also accepting null. But further research might be needed here."

Is the solution. It allows files with the problem to not error out without having to know what's wrong with their font and shouldn't interfere with anything else as long as downstream code is made to handle the same condition.

Then I can deal with those files as needed on the backend analyzing the produced json (i.e. giving it a default or replacing whatever bad font is causing it). As it stands I can't process those files at all.

GreyWyvern commented 8 months ago

I suspect this might be another inline image issue, the same as #691, where binary image data containing 'q' or 'Q' is unbalancing the stored state of the document, which includes fonts.

@bleigh-gemnisw if it is at all possible to send the affected PDFs to bhuisman at greywyvern dot com so I can verify this privately, I'd appreciate it.

paytah232 commented 2 months ago

@k00ni @GreyWyvern - Sorry for being absent from this for so long, but whatever was causing my files not to work, now seems to be resolved when running on v2.11

Both of the examples I have still have a very interesting looking text output (i.e the encoding seems odd - mostly legible, but weird - characters swapped, missing or just wrong), but it now at least outputs the data from getDataTm() without erroring out.

In its current state, this is now usable for me on those original documents, but I understand others like @bleigh-gemnisw may still be having other issues.

I did also try it on a graphic heavy NRMA insurace certificate, and it died stating an infinite loop. I'm assuming this is due to the complexity, rather than the content, but I do not know. I have a small snippet if it is at all helpful: Fatal error: Uncaught Error: Xdebug has detected a possible infinite loop, and aborted your script with a stack depth of '256' frames in /volume1/web/devel/includes/database.php on line 60 Error: Xdebug has detected a possible infinite loop, and aborted your script with a stack depth of '256' frames in /volume1/web/devel/includes/database.php on line 60 Call Stack: 0.0002 371400 1. {main}() /volume1/web/devel/scripts/testing/pdf.php:0 0.0227 1657608 2. Smalot\PdfParser\Parser->parseFile($filename = 'nrma.pdf') /volume1/web/devel/scripts/testing/pdf.php:13 0.0228 1727240 3. Smalot\PdfParser\Parser->parseContent($content = '%PDF-1.4\n%��\n1 0 obj\n<<\n/Creator <2800EFAC7483BAB7AF48191E3A90BA50354B84CD9B75A7C2665FAE>\n/Producer <2800EFAC7483BAB7AF48191E3A90BA50354B84CD9B75A7C2665FAE>\n/CreationDate <3F57BCEE3297C9E1F3145D4767C6E9167715DA998B68A0>\n>>\nendobj\n2 0 obj\n<<\n /N 3\n /Length 3 0 R\n /Filter

This seems to come from the data and dies in FilterHelper.php (according to my log): ` /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239 0.0235 1748480 10. {closure:/volume1/web/devel/includes/load.php:94-107}($errno = 2, $errstr = 'gzuncompress(): data error', $errfile = '/volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php', $errline = 239) /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239 0.0235 1749216 11. logFailure($action = 'Error:

: 2

Message: gzuncompress(): data error File: /volume1/web/devel/includes/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php Line: 239 ', $backtrace_error = ???`

smalot / pdfparser

Font Fallback Issue #657

Description:

PDF input

Expected output & actual output

Code

: 2