Closed lonelyrider44 closed 2 months ago
I get this issue too with v 2.8.0
This happens with v2.8.0. 2.7.0 works fine
What? Where? Versions? I can't see where the problem is and why. Please provide information about your setup, a stack trace of the error (or at least file line) and a PDF/example string which triggers the exception.
CC @GreyWyvern might be relevant to you.
Likely the result of binary content slipping through to the formatContent()
function. The regexp looks for balanced parentheses in document content (where balance is required), but a binary stream can loop into a huge futile regexp.
Any chance you could post the offending PDFs, @lonelyrider44 or @JohnMirro ?
@GreyWyvern sorry, I lost those PDF
InvoicesMN0052-2320230303093445.pdf
Stack trace:
[2024-01-24 21:30:54] local.ERROR: preg_match(): Compilation failed: regular expression is too large at offset 143690 {"exception":"[object] (ErrorException(code: 0): preg_match(): Compilation failed: regular expression is too large at offset 143690 at C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php:221)
[stacktrace]
#0 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Foundation\\Bootstrap\\HandleExceptions.php(270): Illuminate\\Foundation\\Bootstrap\\HandleExceptions->handleError(2, 'preg_match(): C...', 'C:\\\\dev\\\\bs\\\\city-...', 221)
#1 [internal function]: Illuminate\\Foundation\\Bootstrap\\HandleExceptions->Illuminate\\Foundation\\Bootstrap\\{closure}(2, 'preg_match(): C...', 'C:\\\\dev\\\\bs\\\\city-...', 221)
#2 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(221): preg_match('/\\\\(\\\\(\\x0E\\xD2\\xD5`\\\\000\\\\0...', '\\x01\\xFF\\xFF\\xFF\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00...', Array)
#3 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(345): Smalot\\PdfParser\\PDFObject->formatContent('\\x01\\xFF\\xFF\\xFF\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00...')
#4 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(598): Smalot\\PdfParser\\PDFObject->getSectionsText('\\x01\\xFF\\xFF\\xFF\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00...')
#5 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(577): Smalot\\PdfParser\\PDFObject->getTextArray(Object(Smalot\\PdfParser\\Page))
#6 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(695): Smalot\\PdfParser\\PDFObject->getText(Object(Smalot\\PdfParser\\Page))
#7 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(577): Smalot\\PdfParser\\PDFObject->getTextArray(Object(Smalot\\PdfParser\\Page))
#8 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\Page.php(220): Smalot\\PdfParser\\PDFObject->getText(Object(Smalot\\PdfParser\\Page))
#9 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\Document.php(438): Smalot\\PdfParser\\Page->getText()
#10 C:\\dev\\app\\Services\\MoneyOrderPdfService.php(209): Smalot\\PdfParser\\Document->getText()
#11 C:\\dev\\app\\Services\\MoneyOrderPdfService.php(186): App\\Services\\MoneyOrderPdfService->get_pdf_text('2023-02/Invoice...')
#12 C:\\dev\\app\\Services\\MoneyOrderPdfService.php(55): App\\Services\\MoneyOrderPdfService->get_pdf_data('2023-02', '2023-02/Invoice...')
#13 [internal function]: App\\Services\\MoneyOrderPdfService->App\\Services\\{closure}('2023-02/Invoice...', 0)
#14 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Collections\\Arr.php(560): array_map(Object(Closure), Array, Array)
#15 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Collections\\Collection.php(768): Illuminate\\Support\\Arr::map(Array, Object(Closure))
#16 C:\\dev\\app\\Services\\MoneyOrderPdfService.php(53): Illuminate\\Support\\Collection->map(Object(Closure))
#17 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Collections\\Traits\\EnumeratesValues.php(235): App\\Services\\MoneyOrderPdfService->App\\Services\\{closure}(Object(Illuminate\\Support\\Collection), 0)
#18 C:\\dev\\app\\Services\\MoneyOrderPdfService.php(40): Illuminate\\Support\\Collection->each(Object(Closure))
#19 C:\\dev\\app\\Console\\Commands\\SyncPdfFilesCommand.php(50): App\\Services\\MoneyOrderPdfService->process('2023-02', '2023-02/Invoice...')
#20 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Container\\BoundMethod.php(36): App\\Console\\Commands\\SyncPdfFilesCommand->handle()
#21 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Container\\Util.php(41): Illuminate\\Container\\BoundMethod::Illuminate\\Container\\{closure}()
#22 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Container\\BoundMethod.php(93): Illuminate\\Container\\Util::unwrapIfClosure(Object(Closure))
#23 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Container\\BoundMethod.php(35): Illuminate\\Container\\BoundMethod::callBoundMethod(Object(Illuminate\\Foundation\\Application), Array, Object(Closure))
#24 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Container\\Container.php(661): Illuminate\\Container\\BoundMethod::call(Object(Illuminate\\Foundation\\Application), Array, Array, NULL)
#25 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Console\\Command.php(183): Illuminate\\Container\\Container->call(Array)
#26 C:\\dev\\vendor\\symfony\\console\\Command\\Command.php(326): Illuminate\\Console\\Command->execute(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Illuminate\\Console\\OutputStyle))
#27 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Console\\Command.php(152): Symfony\\Component\\Console\\Command\\Command->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Illuminate\\Console\\OutputStyle))
#28 C:\\dev\\vendor\\symfony\\console\\Application.php(1078): Illuminate\\Console\\Command->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#29 C:\\dev\\vendor\\symfony\\console\\Application.php(324): Symfony\\Component\\Console\\Application->doRunCommand(Object(App\\Console\\Commands\\SyncPdfFilesCommand), Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#30 C:\\dev\\vendor\\symfony\\console\\Application.php(175): Symfony\\Component\\Console\\Application->doRun(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#31 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Console\\Application.php(102): Symfony\\Component\\Console\\Application->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#32 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Foundation\\Console\\Kernel.php(155): Illuminate\\Console\\Application->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#33 C:\\dev\\artisan(35): Illuminate\\Foundation\\Console\\Kernel->handle(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#34 {main}
"}
Yup, exactly same issue here when uploading payrolls as pdf. Unfortunately cannot provide pdf example as it contains personal data
Thanks for this!
So yes, this is a matter of binary content (an image, or something else) being passed to formatContent()
. formatContent()
does a check for binary content, but only after the problematic regexp. Unfortunately it's required to be this way because the check for binary content may interpret string content as binary, which is perfectly valid. This is the whole reason why strings are removed, then put back.
Either there needs to be a more thorough check for binary before formatContent()
is called, or we need an alternate way to detect if a content stream is binary before the regexp is called.
One way may be to bulk remove everything from the content stream from the first parenthesis (start of the first supposed string) to the end of the stream, which would require only a minimal regexp. Everything left would have to be plain text abiding by a predictable regexp to show validity. If any binary bytes exist in what's left, the stream could safely be discarded.
If anyone has any other ideas, let me know.
One way may be to bulk remove everything from the content stream from the first parenthesis (start of the first supposed string) to the end of the stream, which would require only a minimal regexp. Everything left would have to be plain text abiding by a predictable regexp to show validity. If any binary bytes exist in what's left, the stream could safely be discarded.
Something like this, I think, would cover all the bases:
$testBinary = preg_replace('/\(.*$/s', '', $content);
if (!preg_match('/^[a-zA-Z0-9 \r\n\/\[\].-]*$/', $testBinary)) {
return '';
}
This code resolves the error for this file, but other files may bring up other characters that need to be included in the regexp.
This code resolves the error for this file, but other files may bring up other characters that need to be included in the regexp.
Does it make sense to establish a switch using our Config
class to enable/disable this check? It could be extended so that the developer can not only enable this check but also provides a custom regex-pattern (which overrides the standard one).
WDYT?
Does it make sense to establish a switch using our
Config
class to enable/disable this check?
No, I don't think so. Right now formatContent()
succeeds on probably all but 0.5% (totally a guess!) of PDFs that for some reason send binary content to it. There are already PDFs in PdfParser's test suite that do this, but their binary streams aren't long enough to trigger the "regular expression is too large" error. AFAIK there is no benefit to allowing any binary through since it will just be a case of GIGO.
If anything, we should probably try to figure out why $this->content
is being filled with binary in the first place and nip that in the bud before it's passed to the functions in PDFObject
. PDFObject
handles nothing but text, so the underlying problem is that a different part of the code is creating a PDFObject
out of a binary chunk in error.
It could be extended so that the developer can not only enable this check but also provides a custom regex-pattern (which overrides the standard one).
The PDF Reference determines what is allowed in a valid document stream, so I also don't think it makes sense to allow a custom regexp here. There will be a definitive regexp that only allows characters from a valid stream as defined by the PDF Reference.
This code resolves the error for this file, but other files may bring up other characters that need to be included in the regexp.
Here's the updated regexp that passes all test PDF in the test suite:
$testBinary = preg_replace('/\(.*$/s', '', $content);
if (!preg_match('/^[a-zA-Z0-9 \r\n\/*#<>\[\].\'"_-]*$/', $testBinary)) {
return '';
}
Added characters: *
, #
, <
, >
, _
, '
, and "
. And with this, the second test for binary content later in the function can be removed.
I think the only other issue might be additional allowed characters in a /Name
command. I'll review that in a bit.
Here is a pdf that fails when I test. https://seniorpolitikk.no/wp-content/uploads/2022/11/Rapport-yrkesaktiv-befolkning-2022_Endeleg.pdf
Here is a pdf that fails when I test.
This file works with the $testBinary
and regexp code above inserted.
However, after reviewing the PDF Reference regarding /Name
Objects, it is possible for them to be UTF-8 and contain lots of characters outside of the regexp whitelist above. So probably a better test would be for a valid UTF-8 string, like so:
$testBinary = preg_replace('/\(.*$/s', '', $content);
if (!mb_check_encoding($testBinary, 'UTF-8')) {
return '';
}
Both sample documents in this thread work with the code above added, and passes all unit tests. I think this is probably ready for a PR.
@lonelyrider44 and @oeholmen, can either of you allow us to use one of your test documents in the PdfParser test suite?
Yes, sure, you can add pdf I gave you to your test suite
It's a public pdf, so it should be ok to use for your test suite.
This is still a problem in 2.9.0
Edit: Here's a pdf with which I was able to reproduce it reliably
Which PHP version(s) did you use?
@k00ni I tested this on PHP 8.3, specifically my kolaente/laravel:8.3-octane-frankenphp
docker image.
Yes, some binary content in the PDF is passing the mb_check_encoding(..., 'UTF-8')
check. :| I'll have to upgrade it to a more strict check.
This regexp comes from the W3 and returns false (0) where mb_check_encoding()
returns true for the same string.
$utf8Check = preg_match('/^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/xs', preg_replace('/\(.*$/s', '', $content));
if (false == $utf8Check) {
return '';
}
Reference: https://www.w3.org/International/questions/qa-forms-utf-8.en
1713749804_6625bf2ca6d328b88eb65fba29e6f485153b20ca029192169517132a844ec.pdf
php8.0 pdfparser2.9.0 exception message :【preg_match(): Compilation failed: regular expression is too large at offset 39738】
This happens with v2.8.0. 2.7.0 works fine