Open bpj opened 3 years ago
the logic of Lua pattern is just one letter for one function, support multiple letter pattern may difficult, maybe another matching library is needed. pattern matching in this library just for compatible with Lua's.
So maybe it's worth to considering whether is there any alternatives for pattern matching that support unicode fully?
I guess I could use lrexlib but then patterns will be entirely incompatible with the Lua pattern syntax when my idea is that programs using my (MoonScript) class can supply functions with similar semantics to use instead of string.match
etc. to use by methods of the library, with a pure Lua "mode" still possible. I guess I could write (regex-based) code to translate a superset of Lua pattern syntax into PCRE, but that may easily get bigger than the host library itself.
maybe you could make tables (using scripts from this project) and make a new module for check the Unicode categories.
Just an idea. I need to match a sequence of letters and non-spacing marks, which can't be expressed in Lua patterns even with the extension of the meaning of escapes like
%a
of this module. Now it occurs to me that a possible solution would be if this module supported some extended character class escapes. I would love to do a PR but I don't do C.Perhaps the most straight forward would be if
%x{hhh}
and the other escapes fromutf8.escape
could be used in patterns, including inside character classes so that[%a%x{300}-%x{36f}]+
would match letters followed by characters from the Combining Diacritical Marks block (although there are many non-spacing marks outside that block!)A perhaps somewhat more key-hole-surgery solution would be a character class escape
%m
which matches any character with General Category M and its complement%M
.Somewhat more generally perhaps an escape pattern
%g{Gc}
(and complement%G{Gc}
) whereGc
is a one- or two-letter General Category abbreviation like L, Lu, Lo, M, Mn, P, Ps, Pe matching any character which does/doesn't belong to that General Category. The curlies would of course have to be required so that one can still use the regular character class%g
including%g%{
with a following curly, or perhaps%k{Gc}
as if "Kategory"!The use case is a function for titlecasing words
That
[^%s%d%p%c]*
has worked so far for my data but it's ugly, it works by accident and there may be things which it matches which it shouldn't although it seems this module includes GC S in%p
.