thephpleague / commonmark

Highly-extensible PHP Markdown parser which fully supports the CommonMark and GFM specs.
https://commonmark.thephpleague.com
BSD 3-Clause "New" or "Revised" License
2.73k stars 192 forks source link

Optimize regular expressions #674

Open colinodell opened 3 years ago

colinodell commented 3 years ago

This library makes heavy use of regular expressions. While most of them should be fairly performant, there could certainly be some room for improvement to help improve the performance of this library. Examples of improvements might include:

  1. Replacing non-regex parsing logic with regular expressions (if that's quicker)
  2. Replacing regex-based parsing with logic that doesn't use regular expressions (if that's quicker)
  3. Combining multiple regexes into one (if that's quicker)
  4. Fixing excessive backtracking in expressions
  5. Other improvements to existing expressions
  6. ???

Tools that could help here include:

A partial list of areas where regex is used in this library include:

I will accept (almost) any PR that aims to improve performance, though I would ask that you keep the following in mind:

colinodell commented 2 years ago

I'm removing the v2.1 milestone as I've already tested a number of expressions and am fairly happy with the current state of things. However, I'll keep this open in case any regex experts want to dig deeper and maybe find something that I missed.

live627 commented 1 year ago

regexes with lots of alternations could be optimized like the one I link to

https://github.com/thephpleague/commonmark/blob/42781fde669f255b7e2ca12ffdcd7ac8d95ee64f/src/Util/RegexHelper.php#L44

several alternations could be reduced by combining similar ones into optional atomic groups, but readability and maintainability go down the toilet and break the sewers. However, I cannot find where that specific regex is used.