michelf / php-markdown

Parser for Markdown and Markdown Extra derived from the original Markdown.pl by John Gruber.
http://michelf.ca/projects/php-markdown/
Other
3.42k stars 530 forks source link

Catastrophic backtracking with large inline image #399

Open herndlm opened 7 months ago

herndlm commented 7 months ago

Hi,

with both Markdown and MarkdownExtra there are issues with very long content like inline images. https://github.com/michelf/php-markdown/blob/51613168d71787b0fe8472166ccbfa8d285c02cd/Michelf/MarkdownExtra.php#L1065-L1071 and also the simpler https://github.com/michelf/php-markdown/blob/51613168d71787b0fe8472166ccbfa8d285c02cd/Michelf/Markdown.php#L903-L904 lead to "Backtrack limit exhausted" errors and empty content with the default pcre.backtrack_limit of 1000000. I was about to create a re-producer, but I basically see the same issue on regex101.com and it might be simpler to play around with there: https://regex101.com/r/BcWE82/1

Do you think this is something that can be improved or is it suggested to increase pcre.backtrack_limit to deal with this? This could be the same as #149

michelf commented 7 months ago

In general, excessive backtracking is solved by fixing the regular expressions to avoid doing unnecessary things. Backtracking is prone to result in combinatory explosions where there's no reasonable limit that would work… at best you'll just hang for a few minutes/hours/days depending on the input.

But in this case I believe the problem is that it backtracks a fixed number of time per character but since you have about 720K characters it reaches the limit quite fast, despite being constant-time. So maybe increasing the limit would work fine in this case.

Ideally we'd eliminate this unnecessary backtracking. I played with the regex you put in the regex101 link and came up with this reasonable replacement:

^((?>.+))\n(=+|-+)[ ]*\n+

or this equivalent one:

^([^\n]+)\n(=+|-+)[ ]*\n+

Neither will cause backtracking. Either one can be used as a replacement. More work is needed to improve on the one in the MarkdownExtra parser which is a bit more complex.

It's likely others regular expressions could cause excessive backtracking like this. I don't test with 720K character lines so I don't know.

herndlm commented 7 months ago

you're right, I could make it work by increasing pcre.backtrack_limit and it seems to scale linear with the line length basically. your regexes here seem to work fine too, feel free to decide what to do. Adapting that regex sounds like a good idea, but you're right MarkdownExtra should be adapted/improved too most likely. if possible.