Open bernemann opened 1 month ago
Sounds reasonable and interesting so please create a pull request and lets discuss there.
Is there any reason for using this complex placeholder or can it be replaced by a much simpler one?
Without a deeper look into the code, it doesn't seem necessary to use an expensive function such as uniqid()
here. Instead an incremented number seems to be enough.
Second, I recognized 2 (more or less) subsequent calls to ...
Because I don't know where we go with the first point, please use a separate PR for the second one. At least for now.
The reason for the Unique ID is that the simpler the replacement string, the higher the chance it may appear elsewhere in the document and cause the wrong text to be restored later. For example, @@@S_2@@@
has a higher chance of appearing as regular document stream text than @@@STRING_b64b8a7ad6d4c@@@
etc. Maybe not all that much higher, but still higher.
That being said, if the uniqid()
calls can be replaced with a system that uses less CPU, but still keeps the complexity of the replacement string, I'm all for that.
Edit: Wait, you're saying the str_replace()
call later is what's causing the slowdown, because the search string is so long? Not uniqid()
?
The reason for the Unique ID is that the simpler the replacement string, the higher the chance it may appear elsewhere in the document and cause the wrong text to be restored later. For example,
@@@S_2@@@
has a higher chance of appearing as regular document stream text than@@@STRING_b64b8a7ad6d4c@@@
etc. Maybe not all that much higher, but still higher.That being said, if the
uniqid()
calls can be replaced with a system that uses less CPU, but still keeps the complexity of the replacement string, I'm all for that._Edit: Wait, you're saying the
str_replace()
call later is what's causing the slowdown, because the search string is so long? Notuniqid()
?_
Exactly, str_replace()
is taking up to 75% of the parsing time in my profiling example because of the complexity of the search pattern.
Understood the increased likelihood of unintended text replaces. What do you think about using preg_replace_callback()
like
$content = preg_replace_callback("/@@@STRING_.*?@@@/",
fn ($match) => $pdfstrings[substr($match[0], 3, -3)],
$content
);
paired with more_entropy = false
to slightly reduce complexity in
$dictid = uniqid('STRING_', false);
Seems to result in comparable performance gains.
uniqid
is only used in 3 places (all inside PDFObject).
$id = uniqid('IMAGE_', true);
$id = uniqid('STRING_', true);
$dictid = uniqid('DICT_', true);
The following code should produce a random-enough string but with lesser CPU usage (not benchmarked though!):
$hash = hash('md2', rand());
$id = 'IMAGE_'.substr($hash, 0, 7);
md2
is an old hash function with a couple of disadvantages (e.g. collisions, source), but it should be "secure" enough for our use case.
The code could be even simpler if an incremented integer value is used instead of rand()
. If the value is globally unique, rand
only adds further CPU usage without more uniqueness, doesn't it?
So we would end up with something like:
private function formatContent(?string $content): string
{
$elementId = 0;
// ...
$hash = hash('md2', ++$elementId);
// maybe length (5) can be even shorter?
$id = 'IMAGE_'.substr($hash, 0, 5);
// ...
}
Btw. uniqid
has its problems too, as described here: https://www.php.net/manual/de/function.uniqid.php#120123
<?php
for($i=0;$i<20;$i++) {
echo uniqid(), PHP_EOL;
}
?>
Output:
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
5819f3ad1c0ce
...
Besides replacing uniqid
, would it help to add a suffix, so the hash can be shorter?
$hash = hash('md2', rand());
$id = 'IMAGE_'.substr($hash, 0, 7).'_IMAGE_END';
What do you think about using preg_replace_callback() like ... paired with more_entropy = false to slightly reduce complexity in
I assume the reduction is because of more_entropy = false
? That could be an option too, if we don't find a replacement at all. Please explain the preg_replace_callback
part, I don't know what its benefits are here.
Why use big function like a Message Digest hash when there's ElfHash which is faster than CRC32 from what i have heard. example:
public function ELFHash($string, $len = null)
{
$hash = 0;
$len || $len = strlen($string);
for ($i=0; $i<$len; $i++) {
$hash = ($hash << 4) + ord($string[$i]); $x = $hash & 0xF0000000; if ($x != 0) { $hash ^= ($x >> 24);
}
$hash &= ~$x;
}
return ($hash % 0xFFFFFFFF) & 0xFFFFFFFF;
}
New to php and i found this:
<?php
$h = hash("xxh3", $data, options: ["seed" => 42]);
echo $h, "\n";
?>
The performance loss is not caused by calling uniqid()
(or any other hash-like function) but by the subsequent replacements using str_replace()
on the whole content:
The call to str_replace('@@@'.$id.'@@@', $text, $content)
inside the loop for every replacement scans $content
for a rather complex placeholder like @@@STRING_blablablablabla.blablablablabl@@@
produced by earlier calls to uniqid()
.
My first idea was to simplify the placeholder which lead to a huge performance gain in my case with the downside of an increased likelihood of unintended replacements (@@@S_1@@@
has much higher chance of beeing real PDF content than @@@STRING_blablablablabla.blablablablabl@@@
).
Second idea now is to slightly reduce complexity in the placeholder with more_entropy=false
combined with above mentioned regular expression callback preg_replace_callback()
.
I will do a few tests, compare performance and make a suggestion.
You might want to benchmark using preg_replace()
with a limit of 1 instead of str_replace()
which can't be limited. The preg_replace()
call would quit right after the first match, while the str_replace()
would continue to search through the whole document. That might save some processing?
Maybe even strtr()
would be faster...
I parse PDFs where strings are encoded in the bracket notation like
[(Xxxx)1.7(a)1.2(tt )-10.1(\374)1.1(be)1.4(r )-8.5(Xxxx)-7.8(xx)-9.5(xxx)2.2(xxx)-1.1(x)-9(xxx)2( )]TJ
I have an idea to improve parsing performance which reduced parsing time from 1.65 seconds to 0.54 seconds in my example:
First, I observed a significant performance increase when calling
Page::getTextArray()
on documents with strings like this with a simpler placeholder inPDFObject::formatContent()
like$id = "S_$i";
where$i
is a simple counting integer instead of$id = uniqid('STRING_', true);
Reason for this are the latter numerous calls to
str_replace()
in order to replace back the placeholders with the original content. The complex placeholder generated byuniqid()
seems to slow down the calls.Is there any reason for using this complex placeholder or can it be replaced by a much simpler one?
Second, I recognized 2 (more or less) subsequent calls to
PDFObject->getTextArray($this)
inPage::getTextArray()
around line where the first one might immediatlly return:https://github.com/smalot/pdfparser/blob/a19d5550531796aa1d2d351a00066e422c848123/src/Smalot/PdfParser/Page.php#L346-L350
Is there a reason for not returning in the
try
case and instead return the same call laterhttps://github.com/smalot/pdfparser/blob/a19d5550531796aa1d2d351a00066e422c848123/src/Smalot/PdfParser/Page.php#L365
?
Thank you for your time!