smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 536 forks source link

HorizontalOffset is not supported anymore #736

Open luigif opened 1 month ago

luigif commented 1 month ago

The config property HorizontalOffset, that was useful in dealing with format issues (https://github.com/smalot/pdfparser/blob/v2.11.0/doc/CustomConfig.md), is not checked anymore. It can be set as described in the docs, but it's useless.

The last version checking and using its value was 2.7.0, any later version ignores its settings.

GreyWyvern commented 1 month ago

Yeah, my rewrite of the document stream parsing code dropped this config variable off the table. The unit tests just test that it returns the value properly instead of actually testing it against document text, so my changes sailed through without errors.

One place where this config value definitely could be inserted back is in Font.php near the bottom of the decodeText() function:

// Cut down on the number of unnecessary internal spaces by
// imploding the string on the null byte, and checking if the
// text includes extra spaces on either side. If so, merge
// where appropriate.
$words = implode("\x00\x00", $words);
$hOffset = $this->config->getHorizontalOffset();
$words = str_replace(
    [" \x00\x00 ", "\x00\x00 ", " \x00\x00", "\x00\x00"],
    [' '.$hOffset.' ', $hOffset.' ', ' '.$hOffset, $hOffset],
    $words
);

... but this is probably not going to affect as many places in the generated text as the previous algorithm did. If you can check whether inserting this code solves your particular issue @luigif, we could add this back in as at least a partial fix.

Note: I'm not sure the above is the final fix; I'll have to run it on more test documents.

luigif commented 1 month ago

The patch in Fonts.php does not solve my problem. With previous library versions I was able to fix issues in tables with some HorizontalOffset tweaking.

If you need a pdf example you can check the tables in the following document: https://www.figc-sardegna.it/wp-content/plugins/download-attachments/includes/download.php?id=19995 In the converted text spaces are added or subtracted randomly breaking the tables formatting.

If you have any idea of where to look or what parameters are relevant to the issue I can do more tests.