Open parpalak opened 2 months ago
I also had this problem, and made a workaround for myself in this if: https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/PDFObject.php#L783
I changed it from
if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
// Not a circular reference.
$text[] = $xobject->getText($page);
}
to
if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
// Not a circular reference.
//Only add to text if there was any Text to begin with, else the count of texts and TJ/Tj commands dont match and the last Texts will be ignored
$newText = $xobject->getText($page);
if($newText === ' ') {
break;
}
$text[] = $newText;
}
I didnt create a PR because i wasnt 100% sure if this is the correct fix, or just a dirty workaround. But maybe this can help someone with the problem.
I found an issue with the getDataTm() method in version 2.11. In some cases, the result contains text from a neighboring block instead of the block specified by the coordinates. The reason is that the PDFObject::getTextArray() method returns some text from a "Do" command at the location of certain xobjects: https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/PDFObject.php#L785
Then, inside the getDataTm() method, strings from PDFObject::getTextArray() are matched with commands returned by the Page::getDataCommands() method: https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/Page.php#L730 https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/Page.php#L685
However, the latter does not return the "Do" command, so there are more elements in PDFObject::getTextArray() than in Page::getDataCommands(), leading to a mismatch.
Unfortunately, I cannot provide a minimal PDF example. The files I have to parse are too large, and I don't know how they were generated. In my case, commenting out
$text[] = $xobject->getText($page);
helped. Since I'm not sure what the original intent of handling "Do" was, I cannot suggest a pull request that would fix this issue.