pomsky-lang / pomsky

A new, portable, regular expression language
https://pomsky-lang.org
Apache License 2.0
1.28k stars 19 forks source link

.NET: `\w` (and by extension `\b` and `\B`) don't conform to Unicode #88

Open Aloso opened 1 year ago

Aloso commented 1 year ago

\w is equivalent to [\p{L}\p{Mn}\p{Nd}\p{Pc}] in .NET instead of [\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_Control}]:

  1. It incorrectly uses GC=Letter instead of Alphabetic=Yes; the latter includes more code points!
  2. It doesn't match all of GC=Mark, only GC=Nonspacing_Mark
  3. It doesn't match Join_Control=Yes

AFAIK there's nothing we can do other than emitting a warning: \p{Alpha} doesn't work in .NET, so we can't polyfill it. But a warning adds noise and doesn't help much when there isn't a straightforward fix.