smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 536 forks source link

getDataTm() provides wrong coordinates for text blocks #733

Open parpalak opened 2 months ago

parpalak commented 2 months ago

I found an issue with the getDataTm() method in version 2.11. In some cases, the result contains text from a neighboring block instead of the block specified by the coordinates. The reason is that the PDFObject::getTextArray() method returns some text from a "Do" command at the location of certain xobjects: https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/PDFObject.php#L785

Then, inside the getDataTm() method, strings from PDFObject::getTextArray() are matched with commands returned by the Page::getDataCommands() method: https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/Page.php#L730 https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/Page.php#L685

However, the latter does not return the "Do" command, so there are more elements in PDFObject::getTextArray() than in Page::getDataCommands(), leading to a mismatch.

Unfortunately, I cannot provide a minimal PDF example. The files I have to parse are too large, and I don't know how they were generated. In my case, commenting out $text[] = $xobject->getText($page); helped. Since I'm not sure what the original intent of handling "Do" was, I cannot suggest a pull request that would fix this issue.

DominikDostal commented 2 months ago

I also had this problem, and made a workaround for myself in this if: https://github.com/smalot/pdfparser/blob/ac8e6678b0940e4b2ccd5caadd3fb18e68093be6/src/Smalot/PdfParser/PDFObject.php#L783

I changed it from

if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
    // Not a circular reference.
    $text[] = $xobject->getText($page);
}

to

if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
    // Not a circular reference.

    //Only add to text if there was any Text to begin with, else the count of texts and TJ/Tj commands dont match and the last Texts will be ignored
    $newText = $xobject->getText($page);
    if($newText === ' ') {
        break;
    }
    $text[] = $newText;
}

I didnt create a PR because i wasnt 100% sure if this is the correct fix, or just a dirty workaround. But maybe this can help someone with the problem.