smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Possibility to improve parsing perfomance by not using uniqid() and add an early return? #712

Open bernemann opened 1 month ago

bernemann commented 1 month ago

I parse PDFs where strings are encoded in the bracket notation like

[(Xxxx)1.7(a)1.2(tt )-10.1(\374)1.1(be)1.4(r )-8.5(Xxxx)-7.8(xx)-9.5(xxx)2.2(xxx)-1.1(x)-9(xxx)2( )]TJ

I have an idea to improve parsing performance which reduced parsing time from 1.65 seconds to 0.54 seconds in my example:

First, I observed a significant performance increase when calling Page::getTextArray() on documents with strings like this with a simpler placeholder in PDFObject::formatContent() like

$id = "S_$i"; where $i is a simple counting integer instead of $id = uniqid('STRING_', true);

Reason for this are the latter numerous calls to str_replace() in order to replace back the placeholders with the original content. The complex placeholder generated by uniqid() seems to slow down the calls.

Is there any reason for using this complex placeholder or can it be replaced by a much simpler one?

Second, I recognized 2 (more or less) subsequent calls to PDFObject->getTextArray($this) in Page::getTextArray() around line where the first one might immediatlly return:

https://github.com/smalot/pdfparser/blob/a19d5550531796aa1d2d351a00066e422c848123/src/Smalot/PdfParser/Page.php#L346-L350

Is there a reason for not returning in the try case and instead return the same call later

https://github.com/smalot/pdfparser/blob/a19d5550531796aa1d2d351a00066e422c848123/src/Smalot/PdfParser/Page.php#L365

?

Thank you for your time!

k00ni commented 1 month ago

Sounds reasonable and interesting so please create a pull request and lets discuss there.

Is there any reason for using this complex placeholder or can it be replaced by a much simpler one?

Without a deeper look into the code, it doesn't seem necessary to use an expensive function such as uniqid() here. Instead an incremented number seems to be enough.

Second, I recognized 2 (more or less) subsequent calls to ...

Because I don't know where we go with the first point, please use a separate PR for the second one. At least for now.

GreyWyvern commented 1 month ago

The reason for the Unique ID is that the simpler the replacement string, the higher the chance it may appear elsewhere in the document and cause the wrong text to be restored later. For example, @@@S_2@@@ has a higher chance of appearing as regular document stream text than @@@STRING_b64b8a7ad6d4c@@@ etc. Maybe not all that much higher, but still higher.

That being said, if the uniqid() calls can be replaced with a system that uses less CPU, but still keeps the complexity of the replacement string, I'm all for that.

Edit: Wait, you're saying the str_replace() call later is what's causing the slowdown, because the search string is so long? Not uniqid()?

bernemann commented 1 month ago

The reason for the Unique ID is that the simpler the replacement string, the higher the chance it may appear elsewhere in the document and cause the wrong text to be restored later. For example, @@@S_2@@@ has a higher chance of appearing as regular document stream text than @@@STRING_b64b8a7ad6d4c@@@ etc. Maybe not all that much higher, but still higher.

That being said, if the uniqid() calls can be replaced with a system that uses less CPU, but still keeps the complexity of the replacement string, I'm all for that.

_Edit: Wait, you're saying the str_replace() call later is what's causing the slowdown, because the search string is so long? Not uniqid()?_

Exactly, str_replace() is taking up to 75% of the parsing time in my profiling example because of the complexity of the search pattern.

bernemann commented 1 month ago

Understood the increased likelihood of unintended text replaces. What do you think about using preg_replace_callback() like

$content = preg_replace_callback("/@@@STRING_.*?@@@/",
    fn ($match) => $pdfstrings[substr($match[0], 3, -3)],
    $content
);

paired with more_entropy = false to slightly reduce complexity in

$dictid = uniqid('STRING_', false);

Seems to result in comparable performance gains.

k00ni commented 1 month ago

uniqid is only used in 3 places (all inside PDFObject).

$id = uniqid('IMAGE_', true);
$id = uniqid('STRING_', true);
$dictid = uniqid('DICT_', true);

The following code should produce a random-enough string but with lesser CPU usage (not benchmarked though!):

$hash = hash('md2', rand());
$id = 'IMAGE_'.substr($hash, 0, 7);

md2 is an old hash function with a couple of disadvantages (e.g. collisions, source), but it should be "secure" enough for our use case.

The code could be even simpler if an incremented integer value is used instead of rand(). If the value is globally unique, rand only adds further CPU usage without more uniqueness, doesn't it?

So we would end up with something like:

private function formatContent(?string $content): string
{
    $elementId = 0;

    // ...

    $hash = hash('md2', ++$elementId);
    // maybe length (5) can be even shorter?
    $id = 'IMAGE_'.substr($hash, 0, 5);

    // ...

}

Btw. uniqid has its problems too, as described here: https://www.php.net/manual/de/function.uniqid.php#120123

<?php
for($i=0;$i<20;$i++) {
    echo uniqid(), PHP_EOL;
}
?>

Output:

5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
...

Besides replacing uniqid, would it help to add a suffix, so the hash can be shorter?

$hash = hash('md2', rand());
$id = 'IMAGE_'.substr($hash, 0, 7).'_IMAGE_END';

What do you think about using preg_replace_callback() like ... paired with more_entropy = false to slightly reduce complexity in

I assume the reduction is because of more_entropy = false? That could be an option too, if we don't find a replacement at all. Please explain the preg_replace_callback part, I don't know what its benefits are here.

Chandlr commented 1 month ago

Why use big function like a Message Digest hash when there's ElfHash which is faster than CRC32 from what i have heard. example:

    public function ELFHash($string, $len = null)
    {
        $hash = 0;
        $len || $len = strlen($string);
        for ($i=0; $i<$len; $i++) {
            $hash = ($hash << 4) + ord($string[$i]); $x = $hash & 0xF0000000; if ($x != 0) { $hash ^= ($x >> 24);
            }
            $hash &= ~$x;
        }
        return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
    }

New to php and i found this:

<?php
$h = hash("xxh3", $data, options: ["seed" => 42]);
echo $h, "\n";
?>
bernemann commented 1 month ago

The performance loss is not caused by calling uniqid() (or any other hash-like function) but by the subsequent replacements using str_replace() on the whole content:

https://github.com/smalot/pdfparser/blob/a19d5550531796aa1d2d351a00066e422c848123/src/Smalot/PdfParser/PDFObject.php#L384-L398

The call to str_replace('@@@'.$id.'@@@', $text, $content) inside the loop for every replacement scans $content for a rather complex placeholder like @@@STRING_blablablablabla.blablablablabl@@@ produced by earlier calls to uniqid().

My first idea was to simplify the placeholder which lead to a huge performance gain in my case with the downside of an increased likelihood of unintended replacements (@@@S_1@@@ has much higher chance of beeing real PDF content than @@@STRING_blablablablabla.blablablablabl@@@).

Second idea now is to slightly reduce complexity in the placeholder with more_entropy=false combined with above mentioned regular expression callback preg_replace_callback().

I will do a few tests, compare performance and make a suggestion.

GreyWyvern commented 1 month ago

You might want to benchmark using preg_replace() with a limit of 1 instead of str_replace() which can't be limited. The preg_replace() call would quit right after the first match, while the str_replace() would continue to search through the whole document. That might save some processing?

Maybe even strtr() would be faster...