Generate smaller regular expressions by using a different tool

JoshyPHP commented 2 years ago

This PR implements an alternative way to build this project's regular expressions. The new expressions are smaller (see the table below) but should be functionally equivalent. npm run test passes.

I am the author of s9e\RegexpBuilder, a PHP library similar in purpose to regexgen. I'm working on a command line tool to allow non-PHP projects to make use of the library, and I chose emoji-test-regex-pattern as a practical use case to test the design.

The tool is named build-regexp. The library can generate regular expressions using PCRE or JavaScript syntax, in Unicode mode or not. The CLI tool supports a number of presets, including re2 and java that are provided for parity; Their output is identical to the pcre preset with a u flag.

While fully functional, this PR is meant as a proof of concept. I modified the build script minimally and as unobtrusively as possible. In addition to the normal files, it generates a new sequences.txt file (for each version) that contains all of the emoji sequences. Then a new script rebuild uses the sequences.txt files to replace the existing regular expression files. That means both scripts must be run in succession to fully update all of the regular expressions. The rebuild script was written in Bash for convenience.

To install:

Get Composer if it's not already installed.
Run composer install at the root of the project.

To update:

npm run build
npm run rebuild

The result is smaller, more efficient regexps. Below is a table that compares the size of the files for the latest version, after each script.

	build	rebuild
cpp-re2.txt	18896	13651
javascript.txt	16569	11635
javascript-u.txt	17716	12913
java.txt	18896	13651

Let me know if you have any questions.

mathiasbynens commented 2 years ago

Thanks for the patch! This sounds really exciting. I'll take a look soon.

Let me know if you have any questions.

What is the trick that allows the regular expressions to be smaller?

Do the generated patterns still put the longest patterns first, similar to https://github.com/tc39/proposal-regexp-set-notation#whats-the-match-order-for-character-classes-containing-strings?

JoshyPHP commented 2 years ago

The main focus for the library is to minimize backtracking. Producing smaller regexps is nice and it can make them faster to parse/compile but modern regexp engines tend to cache the JIT/compiled version anyway so it doesn't make a big difference. Fortunately, eliminating backtracking usually makes the expression smaller too so we get that for free.

The library applies roughly three different optimizations:

Move common prefixes out of alternations. That is, instead of (?:axx|ayy) where a can be matched in two different branches, you get a(?:xx|yy) where a can only be matched once. Technically, regexgen does that too but it seems to be doing it less systematically. This is by far the most impactful optimization.
Move common suffixes out of alternations if it doesn't cause backtracking and doesn't make the regexp more complex. Instead of (?:ax|bx) you get [ab]x.
Move optional strings out of optional alternations. Instead of (?:ab?|b)? it will generate a?b?. That pattern is prevalent in emoji sequences with optional skin tones and optional man/woman signs.

Here's a practical example that you can run at the root of the project. In Unicode 14.0, there are 35 different sequences for a person lifting weights. Here's how you can generate a regexp that matches them all:

$ sequences=$(echo -e "\U1F3CB\UFE0F \U1F3CB \U1F3CB\U1F3FB \U1F3CB\U1F3FC \U1F3CB\U1F3FD \U1F3CB\U1F3FE \U1F3CB\U1F3FF \U1F3CB\UFE0F\U200D\U2642\UFE0F \U1F3CB\U200D\U2642\UFE0F \U1F3CB\UFE0F\U200D\U2642 \U1F3CB\U200D\U2642 \U1F3CB\U1F3FB\U200D\U2642\UFE0F \U1F3CB\U1F3FB\U200D\U2642 \U1F3CB\U1F3FC\U200D\U2642\UFE0F \U1F3CB\U1F3FC\U200D\U2642 \U1F3CB\U1F3FD\U200D\U2642\UFE0F \U1F3CB\U1F3FD\U200D\U2642 \U1F3CB\U1F3FE\U200D\U2642\UFE0F \U1F3CB\U1F3FE\U200D\U2642 \U1F3CB\U1F3FF\U200D\U2642\UFE0F \U1F3CB\U1F3FF\U200D\U2642 \U1F3CB\UFE0F\U200D\U2640\UFE0F \U1F3CB\U200D\U2640\UFE0F \U1F3CB\UFE0F\U200D\U2640 \U1F3CB\U200D\U2640 \U1F3CB\U1F3FB\U200D\U2640\UFE0F \U1F3CB\U1F3FB\U200D\U2640 \U1F3CB\U1F3FC\U200D\U2640\UFE0F \U1F3CB\U1F3FC\U200D\U2640 \U1F3CB\U1F3FD\U200D\U2640\UFE0F \U1F3CB\U1F3FD\U200D\U2640 \U1F3CB\U1F3FE\U200D\U2640\UFE0F \U1F3CB\U1F3FE\U200D\U2640 \U1F3CB\U1F3FF\U200D\U2640\UFE0F \U1F3CB\U1F3FF\U200D\U2640")
$ ./node_modules/.bin/regexgen $sequences
/\uD83C\uDFCB(?:(?:\uFE0F|\uD83C[\uDFFB-\uDFFF])(?:\u200D[\u2640\u2642]\uFE0F?)?|\u200D[\u2640\u2642]\uFE0F?)?/
$ ./vendor/bin/build-regexp --preset javascript $sequences
\uD83C\uDFCB(?:\uFE0F|\uD83C[\uDFFB-\uDFFF])?(?:\u200D[\u2640\u2642]\uFE0F?)?

It will match the longest string possible. Input strings are sorted in byte order but it doesn't matter here; You can't generate (?:xy|xyz) as it will be optimized as xyz?.

$ ./vendor/bin/build-regexp 'a' 'b' 'c' 'W' 'xy' 'xyz'
(?:[Wa-c]|xyz?)

mathiasbynens commented 2 years ago

Thanks for your detailed comments!

This is somewhat off-topic, but:

$ ./vendor/bin/build-regexp 'a' 'b' 'c' 'W' 'xy' 'xyz'
(?:[Wa-c]|xyz?)

Why is the output wrapped in (?: and ) in this case?

Either way, it looks like build-regexp's output is superior to regexgen, so I'm really excited about this. To get the patch landed, I do have a few minor concerns but I'm sure we can address them in some way:

In the short term I'm fine with the additional dependency on Composer, but I'd love to avoid it longer-term.
- For now, we should document the setup instructions in the README (since we'd be deviating from the usual git clone + npm install + npm run build for JS projects).
- For the future, is there a universal build-regexp binary that you could publish to npm? Could there be? If not, are you planning a JS port of your tool?
Instead of requiring folks to run npm run build + npm run rebuild in the right order, let's invoke the new script from npm run build itself.
I'm trying to think of a better name than sequences.txt, since not all of them are emoji sequences. We're already using index.txt, though. Maybe just emoji.txt? Any other ideas? Naming things is hard :)
- Whatever the name ends up being, let's document the new file in the README as well.
- Alternatively, write this file to a different, temporary, folder that is .gitignored. Maybe this is better, since we don't really need to ship/publish this new file.

WDYT?

JoshyPHP commented 2 years ago

The (?:) is just a quirk from the library: all the alternations are constructed as non-capturing groups. Incidentally, it allows the generated expression to be used to compose a bigger regexp. For example, create a regexp made of a list of words, followed by \b. If parentheses were omitted at the root of the expression, one may end up with something like xx|yy\b which would be different from (?:xx|yy)\b. A future version may have a standalone option to indicate that the expression will be used whole and it would also output the regexp delimiters and flags where applicable (JS and PHP) but it's low priority at the moment.

With regards to depending on Composer, PHP has a way to distribute applications as a single "phar" file. It still requires a PHP interpreter though. I haven't had time to set it up yet but I'd like to have a phar version as a release artifact. Here's a link to the phar for the 0.4.0 release. I am not aware of a tool that creates a true executable from PHP sources.

I have no plans for a JS port in the foreseeable future. The codebase is about 1,000 lines of code with a test suite of about 1,500 LOC. It's relatively small but it relies on some PHP idioms that don't translate to JavaScript (such as [1] === [1]) so it wouldn't be a straight port.

I don't really have an opinion about the build process. I implemented it as a separate script for my own convenience and to allow you to more easily switch between the two regexp generators. I think that emoji.txt is fine. The file's content is similar to the emoji-test.txt file distributed by Unicode.

mathiasbynens commented 2 years ago

With regards to depending on Composer, PHP has a way to distribute applications as a single "phar" file. It still requires a PHP interpreter though. I haven't had time to set it up yet but I'd like to have a phar version as a release artifact. Here's a link to the phar for the 0.4.0 release. I am not aware of a tool that creates a true executable from PHP sources.

Filed a tracking issue for that: https://github.com/s9e/RegexpBuilderCommand/issues/1 Again, this is not a blocker, but it would be a nice future simplification from our side.

Would you like to address my comments in https://github.com/mathiasbynens/emoji-test-regex-pattern/pull/18#issuecomment-1052152977, or do you want me to give it a go (I'd be happy to)?

JoshyPHP commented 2 years ago

Did I miss something? If you're talking about creating a JS port yourself then you have my blessing and any support I can provide. If you're not in a rush to start, I can write an overview of how the library operates. I think you'd be better off reimplementing it than doing an exact port.

mathiasbynens commented 2 years ago

Did I miss something?

I meant the other bullet points I mentioned in my earlier comment. Don't worry, I can take care of addressing them. Thanks for the patch, I'll take it from here!

mathiasbynens commented 2 years ago

I’ve addressed all my feedback and integrated the script with the existing build script rather than keeping the two separate. I’ve also made use of the generated *.phar file that you’re now publishing — thanks again for that!

Publishing the phar file to npm would make it even nicer but this is definitely good enough for now. Merging…

mathiasbynens commented 2 years ago

Here are the file size savings:

$ git file-size-diff HEAD~ HEAD # and manually filtered afterwards
-5059   dist/emoji-13.0/cpp-re2.txt
-5059   dist/emoji-13.0/java.txt
-4651   dist/emoji-13.0/javascript-u.txt
-4529   dist/emoji-13.0/javascript.txt

-5276   dist/emoji-13.1/cpp-re2.txt
-5276   dist/emoji-13.1/java.txt
-4832   dist/emoji-13.1/javascript-u.txt
-4910   dist/emoji-13.1/javascript.txt

-5253   dist/emoji-14.0/cpp-re2.txt
-5253   dist/emoji-14.0/java.txt
-4809   dist/emoji-14.0/javascript-u.txt
-4940   dist/emoji-14.0/javascript.txt

-5253   dist/latest/cpp-re2.txt
-5253   dist/latest/java.txt
-4809   dist/latest/javascript-u.txt
-4940   dist/latest/javascript.txt

mathiasbynens / emoji-test-regex-pattern

Generate smaller regular expressions by using a different tool #18