smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

preg_match(): compilation failed: regular expression is too large to offset 143690 #668

Closed lonelyrider44 closed 2 months ago

lonelyrider44 commented 5 months ago

This happens with v2.8.0. 2.7.0 works fine

JohnMirro commented 5 months ago

I get this issue too with v 2.8.0

k00ni commented 5 months ago

This happens with v2.8.0. 2.7.0 works fine

What? Where? Versions? I can't see where the problem is and why. Please provide information about your setup, a stack trace of the error (or at least file line) and a PDF/example string which triggers the exception.

CC @GreyWyvern might be relevant to you.

GreyWyvern commented 5 months ago

Likely the result of binary content slipping through to the formatContent() function. The regexp looks for balanced parentheses in document content (where balance is required), but a binary stream can loop into a huge futile regexp.

Any chance you could post the offending PDFs, @lonelyrider44 or @JohnMirro ?

JohnMirro commented 5 months ago

@GreyWyvern sorry, I lost those PDF

lonelyrider44 commented 5 months ago

InvoicesMN0052-2320230303093445.pdf

Stack trace:

[2024-01-24 21:30:54] local.ERROR: preg_match(): Compilation failed: regular expression is too large at offset 143690 {"exception":"[object] (ErrorException(code: 0): preg_match(): Compilation failed: regular expression is too large at offset 143690 at C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php:221)
[stacktrace]
#0 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Foundation\\Bootstrap\\HandleExceptions.php(270): Illuminate\\Foundation\\Bootstrap\\HandleExceptions->handleError(2, 'preg_match(): C...', 'C:\\\\dev\\\\bs\\\\city-...', 221)
#1 [internal function]: Illuminate\\Foundation\\Bootstrap\\HandleExceptions->Illuminate\\Foundation\\Bootstrap\\{closure}(2, 'preg_match(): C...', 'C:\\\\dev\\\\bs\\\\city-...', 221)
#2 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(221): preg_match('/\\\\(\\\\(\\x0E\\xD2\\xD5`\\\\000\\\\0...', '\\x01\\xFF\\xFF\\xFF\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00...', Array)
#3 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(345): Smalot\\PdfParser\\PDFObject->formatContent('\\x01\\xFF\\xFF\\xFF\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00...')
#4 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(598): Smalot\\PdfParser\\PDFObject->getSectionsText('\\x01\\xFF\\xFF\\xFF\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00...')
#5 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(577): Smalot\\PdfParser\\PDFObject->getTextArray(Object(Smalot\\PdfParser\\Page))
#6 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(695): Smalot\\PdfParser\\PDFObject->getText(Object(Smalot\\PdfParser\\Page))
#7 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\PDFObject.php(577): Smalot\\PdfParser\\PDFObject->getTextArray(Object(Smalot\\PdfParser\\Page))
#8 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\Page.php(220): Smalot\\PdfParser\\PDFObject->getText(Object(Smalot\\PdfParser\\Page))
#9 C:\\dev\\vendor\\smalot\\pdfparser\\src\\Smalot\\PdfParser\\Document.php(438): Smalot\\PdfParser\\Page->getText()
#10 C:\\dev\\app\\Services\\MoneyOrderPdfService.php(209): Smalot\\PdfParser\\Document->getText()
#11 C:\\dev\\app\\Services\\MoneyOrderPdfService.php(186): App\\Services\\MoneyOrderPdfService->get_pdf_text('2023-02/Invoice...')
#12 C:\\dev\\app\\Services\\MoneyOrderPdfService.php(55): App\\Services\\MoneyOrderPdfService->get_pdf_data('2023-02', '2023-02/Invoice...')
#13 [internal function]: App\\Services\\MoneyOrderPdfService->App\\Services\\{closure}('2023-02/Invoice...', 0)
#14 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Collections\\Arr.php(560): array_map(Object(Closure), Array, Array)
#15 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Collections\\Collection.php(768): Illuminate\\Support\\Arr::map(Array, Object(Closure))
#16 C:\\dev\\app\\Services\\MoneyOrderPdfService.php(53): Illuminate\\Support\\Collection->map(Object(Closure))
#17 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Collections\\Traits\\EnumeratesValues.php(235): App\\Services\\MoneyOrderPdfService->App\\Services\\{closure}(Object(Illuminate\\Support\\Collection), 0)
#18 C:\\dev\\app\\Services\\MoneyOrderPdfService.php(40): Illuminate\\Support\\Collection->each(Object(Closure))
#19 C:\\dev\\app\\Console\\Commands\\SyncPdfFilesCommand.php(50): App\\Services\\MoneyOrderPdfService->process('2023-02', '2023-02/Invoice...')
#20 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Container\\BoundMethod.php(36): App\\Console\\Commands\\SyncPdfFilesCommand->handle()
#21 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Container\\Util.php(41): Illuminate\\Container\\BoundMethod::Illuminate\\Container\\{closure}()
#22 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Container\\BoundMethod.php(93): Illuminate\\Container\\Util::unwrapIfClosure(Object(Closure))
#23 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Container\\BoundMethod.php(35): Illuminate\\Container\\BoundMethod::callBoundMethod(Object(Illuminate\\Foundation\\Application), Array, Object(Closure))
#24 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Container\\Container.php(661): Illuminate\\Container\\BoundMethod::call(Object(Illuminate\\Foundation\\Application), Array, Array, NULL)
#25 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Console\\Command.php(183): Illuminate\\Container\\Container->call(Array)
#26 C:\\dev\\vendor\\symfony\\console\\Command\\Command.php(326): Illuminate\\Console\\Command->execute(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Illuminate\\Console\\OutputStyle))
#27 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Console\\Command.php(152): Symfony\\Component\\Console\\Command\\Command->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Illuminate\\Console\\OutputStyle))
#28 C:\\dev\\vendor\\symfony\\console\\Application.php(1078): Illuminate\\Console\\Command->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#29 C:\\dev\\vendor\\symfony\\console\\Application.php(324): Symfony\\Component\\Console\\Application->doRunCommand(Object(App\\Console\\Commands\\SyncPdfFilesCommand), Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#30 C:\\dev\\vendor\\symfony\\console\\Application.php(175): Symfony\\Component\\Console\\Application->doRun(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#31 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Console\\Application.php(102): Symfony\\Component\\Console\\Application->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#32 C:\\dev\\vendor\\laravel\\framework\\src\\Illuminate\\Foundation\\Console\\Kernel.php(155): Illuminate\\Console\\Application->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#33 C:\\dev\\artisan(35): Illuminate\\Foundation\\Console\\Kernel->handle(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))
#34 {main}
"}
cxammar commented 5 months ago

Yup, exactly same issue here when uploading payrolls as pdf. Unfortunately cannot provide pdf example as it contains personal data

GreyWyvern commented 5 months ago

InvoicesMN0052-2320230303093445.pdf

Thanks for this!

So yes, this is a matter of binary content (an image, or something else) being passed to formatContent(). formatContent() does a check for binary content, but only after the problematic regexp. Unfortunately it's required to be this way because the check for binary content may interpret string content as binary, which is perfectly valid. This is the whole reason why strings are removed, then put back.

Either there needs to be a more thorough check for binary before formatContent() is called, or we need an alternate way to detect if a content stream is binary before the regexp is called.

One way may be to bulk remove everything from the content stream from the first parenthesis (start of the first supposed string) to the end of the stream, which would require only a minimal regexp. Everything left would have to be plain text abiding by a predictable regexp to show validity. If any binary bytes exist in what's left, the stream could safely be discarded.

If anyone has any other ideas, let me know.

GreyWyvern commented 5 months ago

One way may be to bulk remove everything from the content stream from the first parenthesis (start of the first supposed string) to the end of the stream, which would require only a minimal regexp. Everything left would have to be plain text abiding by a predictable regexp to show validity. If any binary bytes exist in what's left, the stream could safely be discarded.

Something like this, I think, would cover all the bases:

        $testBinary = preg_replace('/\(.*$/s', '', $content);
        if (!preg_match('/^[a-zA-Z0-9 \r\n\/\[\].-]*$/', $testBinary)) {
            return '';
        }

This code resolves the error for this file, but other files may bring up other characters that need to be included in the regexp.

k00ni commented 5 months ago

This code resolves the error for this file, but other files may bring up other characters that need to be included in the regexp.

Does it make sense to establish a switch using our Config class to enable/disable this check? It could be extended so that the developer can not only enable this check but also provides a custom regex-pattern (which overrides the standard one).

WDYT?

GreyWyvern commented 5 months ago

Does it make sense to establish a switch using our Config class to enable/disable this check?

No, I don't think so. Right now formatContent() succeeds on probably all but 0.5% (totally a guess!) of PDFs that for some reason send binary content to it. There are already PDFs in PdfParser's test suite that do this, but their binary streams aren't long enough to trigger the "regular expression is too large" error. AFAIK there is no benefit to allowing any binary through since it will just be a case of GIGO.

If anything, we should probably try to figure out why $this->content is being filled with binary in the first place and nip that in the bud before it's passed to the functions in PDFObject. PDFObject handles nothing but text, so the underlying problem is that a different part of the code is creating a PDFObject out of a binary chunk in error.

It could be extended so that the developer can not only enable this check but also provides a custom regex-pattern (which overrides the standard one).

The PDF Reference determines what is allowed in a valid document stream, so I also don't think it makes sense to allow a custom regexp here. There will be a definitive regexp that only allows characters from a valid stream as defined by the PDF Reference.

GreyWyvern commented 5 months ago

This code resolves the error for this file, but other files may bring up other characters that need to be included in the regexp.

Here's the updated regexp that passes all test PDF in the test suite:

        $testBinary = preg_replace('/\(.*$/s', '', $content);
        if (!preg_match('/^[a-zA-Z0-9 \r\n\/*#<>\[\].\'"_-]*$/', $testBinary)) {
            return '';
        }

Added characters: *, #, <, >, _, ', and ". And with this, the second test for binary content later in the function can be removed.

I think the only other issue might be additional allowed characters in a /Name command. I'll review that in a bit.

oeholmen commented 4 months ago

Here is a pdf that fails when I test. https://seniorpolitikk.no/wp-content/uploads/2022/11/Rapport-yrkesaktiv-befolkning-2022_Endeleg.pdf

GreyWyvern commented 4 months ago

Here is a pdf that fails when I test.

This file works with the $testBinary and regexp code above inserted.

However, after reviewing the PDF Reference regarding /Name Objects, it is possible for them to be UTF-8 and contain lots of characters outside of the regexp whitelist above. So probably a better test would be for a valid UTF-8 string, like so:

$testBinary = preg_replace('/\(.*$/s', '', $content);
if (!mb_check_encoding($testBinary, 'UTF-8')) {
    return '';
}

Both sample documents in this thread work with the code above added, and passes all unit tests. I think this is probably ready for a PR.

@lonelyrider44 and @oeholmen, can either of you allow us to use one of your test documents in the PdfParser test suite?

lonelyrider44 commented 4 months ago

Yes, sure, you can add pdf I gave you to your test suite

oeholmen commented 4 months ago

It's a public pdf, so it should be ok to use for your test suite.

kolaente commented 3 months ago

This is still a problem in 2.9.0

Edit: Here's a pdf with which I was able to reproduce it reliably

k00ni commented 3 months ago

Which PHP version(s) did you use?

kolaente commented 3 months ago

@k00ni I tested this on PHP 8.3, specifically my kolaente/laravel:8.3-octane-frankenphp docker image.

GreyWyvern commented 3 months ago

Yes, some binary content in the PDF is passing the mb_check_encoding(..., 'UTF-8') check. :| I'll have to upgrade it to a more strict check.

This regexp comes from the W3 and returns false (0) where mb_check_encoding() returns true for the same string.

$utf8Check = preg_match('/^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$/xs', preg_replace('/\(.*$/s', '', $content));

if (false == $utf8Check) {
    return '';
}

Reference: https://www.w3.org/International/questions/qa-forms-utf-8.en

huihuangjiuai commented 2 months ago

1713749804_6625bf2ca6d328b88eb65fba29e6f485153b20ca029192169517132a844ec.pdf

php8.0 pdfparser2.9.0 exception message :【preg_match(): Compilation failed: regular expression is too large at offset 39738】