smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

preg_match(): Compilation failed: regular expression is too large at offset 38605 #709

Closed huihuangjiuai closed 4 weeks ago

huihuangjiuai commented 1 month ago

pdfparser version:2.10.0

I have about 600,000 pdf files, all of which use pdfparser for text extraction.This kind of problem was shown to have been solved on 704, probably because of the Chinese coding problem, and now it appears again, please help to solve it, thank you. 1710747436_65f7ef2ccfac97cc01c0803eda73f23a732316a07e2ab5f2c43ec0e162ac4.pdf 1710812373_65f8ecd57a825b030437578010bf1e1aa0ae31669f11f5fae857a58f22bbf.pdf 1710898905_65fa3ed9d61a7e6d996a0d361fbfe8efa9620742d2e0560f19c51edefd6f2.pdf

1710124871_65ee6f473b7eddbd7437d2803d8f95dff1747721dc88d4035d46343d9766e.pdf

GreyWyvern commented 1 month ago

Will we ever be rid of this one? 😆 😭

This one is being caused by a specific character order in strings where there's an escaped slash immediately before an escaped parenthesis: (Sample \\\(string) The script is only checking two characters behind so it thinks there is an escaped slash before it and the parenthesis is "real", but it should be checking more characters. This way it would find out that both the slash and the parenthesis are escaped and shouldn't be counted.

Should be a simple fix, and something I should have done when accounting for pretty much this same issue in the Inline Image replacement area.