Open UnnitMetaliya opened 1 month ago
Ultimately this is happening because the sample document has Font bfrange
values such as:
<037F> <0389> [<FEDFFEE0FBAB> <FE80> <0622> <FE82> <0623> <FE84> <0624> <FE86> <0625> <FE88> <0626>]
... whereas PdfParser only extracts rows that match a regexp targeting the following format:
<037C> <037D> <FD3E>
I think the solution is to treat rows that have square brackets as a direct string replacement rather than a numerical offset? But I am entirely unsure right now and will need to read up on the correct behaviour here.
Edit: If I alter the regexp in Font.php on line 220:
$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>?[ \r\n]+/is';
... like so:
$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>? *[\r\n]+/is';
... this makes PdfParser ignore all bfrange
lines that have square brackets and the document text will be output successfully. However, the content of the missing rows not being parsed probably means the document output text is missing some content.
@k00ni any idea how such a small thing like [
can cause memory leak issue?
any idea how such a small thing like [ can cause memory leak issue?
No.
@k00ni any idea how such a small thing like
[
can cause memory leak issue?
It's giving the wrong numerical start and end values to a for
loop and it doesn't end. Simple as that.
It's not the [
specifically, it's that the regexp is matching values inside the square brackets and using character values incorrectly as offsets.
Note that the "fix" I provided in my previous post does NOT completely solve this issue; it should not be used as a workaround as it will cause errors in other files in the test suite.
I noticed, after my posts, that the current bfrange
section in Font.php does account for the <0000> <0000> [<0000>]
line format, but does so after it searches for regular <0000> <0000> <0000>
style. The solution is to move the square-bracket format loop first, and then delete those lines from the section before allowing the regular-format loop to go at it. In this way, all items in the font table are properly filled and there is no infinite loop.
The proper fix is below, replacing this entire bfrange
section:
// Support for multiple bfrange sections
if (preg_match_all('/beginbfrange(?P<sections>.*?)endbfrange/s', $content, $matches)) {
foreach ($matches['sections'] as $section) {
// Support for : <srcCode1> <srcCodeN> [<dstString1> <dstString2> ... <dstStringN>]
// Some PDF file has 2-byte Unicode values on new lines > added \r\n
$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *\[(?P<strings>[\r\n<>0-9A-F ]+)\][ \r\n]+/is';
preg_match_all($regexp, $section, $matches);
foreach ($matches['from'] as $key => $from) {
$char_from = hexdec($from);
$strings = [];
preg_match_all('/<(?P<string>[0-9A-F]+)> */is', $matches['strings'][$key], $strings);
foreach ($strings['string'] as $position => $string) {
$parts = preg_split(
'/([0-9A-F]{4})/i',
$string,
0,
\PREG_SPLIT_NO_EMPTY | \PREG_SPLIT_DELIM_CAPTURE
);
$text = '';
foreach ($parts as $part) {
$text .= self::uchr(hexdec($part));
}
$this->table[$char_from + $position] = $text;
}
// Remove these found matches from the bfrange section
// This prevents the regexp below from finding false matches
$section = str_replace($matches[0][$key], '', $section);
}
// Support for : <srcCode1> <srcCode2> <dstString>
$regexp = '/<(?P<from>[0-9A-F]+)> *<(?P<to>[0-9A-F]+)> *<(?P<offset>[0-9A-F]+)>[ \r\n]+/is';
preg_match_all($regexp, $section, $matches);
foreach ($matches['from'] as $key => $from) {
$char_from = hexdec($from);
$char_to = hexdec($matches['to'][$key]);
$offset = hexdec($matches['offset'][$key]);
for ($char = $char_from; $char <= $char_to; ++$char) {
$this->table[$char] = self::uchr($char - $char_from + $offset);
}
}
}
}
The example PDF is a bit too large (466kB) to include in the test-suite I think. @UnnitMetaliya if you remember how you created this file (if you in fact created it) and are able to make a smaller version we can include in the test suite, would you be willing to do that for us?
Otherwise, may we include the file you posted directly in the test suite? It's too complex of an issue to replicate with a mock document. Correctly declared font-tables are required.
@GreyWyvern I did not create the test file but that pdf is publicly available. So, you can use it for sure.
@k00ni do you think the fix @GreyWyvern can go in as permanent fix, if it can be considered as one?
@k00ni do you think the fix @GreyWyvern can go in as permanent fix, if it can be considered as one?
@GreyWyvern thank you for taking the time here.
His changes look reasonable at first glance. I suggest one creates a PR and we see how it goes. My time is currently very limited but I try to help when I can.
I suggest one creates a PR and we see how it goes. My time is currently very limited but I try to help when I can.
I'm on vacation starting tomorrow and will be back on Monday next week. I can create the PR then.
@GreyWyvern I can create now if you want?
@GreyWyvern I can create now if you want?
You can do that, sure. Thanks! But I won't be around to review it until next week. :)
Are you working on a PR for this, @UnnitMetaliya? I don't want to step on any toes.
@GreyWyvern yes. I should be able to put it out there soon. Just busy with work.
PHP 8.3.10 (cli)
smalot/pdfparser": "^2.11"
Description:
PDF input
unparseble.pdf
Expected output & actual output
Getting error:
PHP Fatal error: Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes) in ../smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 150
Symfony\Component\ErrorHandler\Error\FatalError
`Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes)
at vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:150 146▕ 147▕ if (!isset(self::$uchrCache[$code])) { 148▕ // html_entity_decode() will not work with UTF-16 or UTF-32 char entities, 149▕ // therefore, we use mb_convert_encoding() instead ➜ 150▕ self::$uchrCache[$code] = mb_convert_encoding("&#{$code};", 'UTF-8', 'HTML-ENTITIES'); 151▕ } 152▕ 153▕ return self::$uchrCache[$code]; 154▕ }
`Whoops\Exception\ErrorException
Allowed memory size of 3221225472 bytes exhausted (tried to allocate 1342177280 bytes)
at vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:150 146▕ 147▕ if (!isset(self::$uchrCache[$code])) { 148▕ // html_entity_decode() will not work with UTF-16 or UTF-32 char entities, 149▕ // therefore, we use mb_convert_encoding() instead ➜ 150▕ self::$uchrCache[$code] = mb_convert_encoding("&#{$code};", 'UTF-8', 'HTML-ENTITIES'); 151▕ } 152▕ 153▕ return self::$uchrCache[$code]; 154▕ }
2 [internal]:0 Whoops\Run::handleShutdown()
Code
$this->pdfParser = new Parser();
$pdf = $this->pdfParser->parseFile($file);