pomsky-lang / pomsky

A new, portable, regular expression language
https://pomsky-lang.org
Apache License 2.0
1.28k stars 19 forks source link

Add ascii-only mode #72

Closed Aloso closed 1 year ago

Aloso commented 1 year ago

Is your feature request related to a problem?

There are two reasons for this proposal: First, matching [a-zA-Z0-9_] is significantly faster than \w, so it is a worthwhile optimization when the input is known to be ASCII-only. While it is possible to match [a-zA-Z0-9_] with the [ascii_word] shorthand, being able to use [word] (or [w]), [space] and [digit] at no performance cost would be nice.

The other, arguably more important reason is JavaScript, where \w and \d are not Unicode-aware, and neither is \b/\B. Polyfilling \w is quite expensive, generating [\p{Alphabetic}\p{M}\p{Nd}\p{Pc}]. Furthermore, % isn't currently polyfilled, which would be even more expensive:

let correct_boundary = word_start | word_end;
let word_start = (!<< [w]) (>> [w]);
let word_end = (<< [w]) (!>> [w]);

correct_boundary
(?<![\p{Alphabetic}\p{M}\p{Nd}\p{Pc}])(?=[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}])|(?<=[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}])(?![\p{Alphabetic}\p{M}\p{Nd}\p{Pc}])

In addition to being very long, this doesn't work in Safari where lookbehind isn't supported. So we're stuck with using \b, which is incorrect (Pomsky promises Unicode-support, but % isn't Unicode-aware).

Describe the solution you'd like

Make it possible to opt out of Unicode support. When Unicode is disabled, Pomsky assumes that the input is ASCII, and is allowed to optimize accordingly. % is forbidden in JavaScript, unless Unicode is explicitly disabled like so:

disable unicode;
% 'test' %

Unicode properties like [Greek] or [Alphabetic] are not allowed in ASCII-only mode. This is to prevent confusion about what [Alphabetic] compiles to in ASCII-only mode. Would it be just [a-zA-Z], or would it include all alphabetic Unicode code points? By disallowing this, we side-step this problem entirely. However, Unicode string literals are allowed in ASCII-only mode:

disable unicode;
'Gänsefüßchen'   # works :)

Describe alternatives you've considered

Instead of disable unicode;, we could use the syntax enable ascii; to make it clearer that we use the ASCII character set (rather than Latin-1 or Shift-JIS, for example). However, since ASCII is a subset of Unicode, the word enable (implying added functionality) seems wrong. On the other hand, ASCII-only mode does add functionality (optimizations, and % in JavaScript).

One alternative is to use an --ascii CLI flag rather than extending the language itself. However, I believe this would be a bad idea because this flag affects semantics, so forgetting to add it would be a foot gun. Furthermore, the enable/disable machinery is more versatile since it can be applied to individual parts of an expression.

Another way to solve only problem 2 is to just emit a warning when % is used in JS. This feels dissatisfying, since people will be annoyed by it and disable the warning. Then the potential bug is still present, but invisible. Furthermore, disabling the warning means that other important warnings are missed as well. Warnings currently can't disabled on a case-by-case basis. If the warning isn't disabled, users will see it all the time, and are conditioned to ignore any warnings they see, which we do not want.

Aloso commented 1 year ago

This will land in Pomsky 0.10.