starwing / luautf8

a utf-8 support module for Lua and LuaJIT.
MIT License
406 stars 67 forks source link

FR: support for extended character class escapes in patterns #33

Open bpj opened 3 years ago

bpj commented 3 years ago

Just an idea. I need to match a sequence of letters and non-spacing marks, which can't be expressed in Lua patterns even with the extension of the meaning of escapes like %a of this module. Now it occurs to me that a possible solution would be if this module supported some extended character class escapes. I would love to do a PR but I don't do C.

Perhaps the most straight forward would be if %x{hhh} and the other escapes from utf8.escape could be used in patterns, including inside character classes so that [%a%x{300}-%x{36f}]+ would match letters followed by characters from the Combining Diacritical Marks block (although there are many non-spacing marks outside that block!)

A perhaps somewhat more key-hole-surgery solution would be a character class escape %m which matches any character with General Category M and its complement %M.

Somewhat more generally perhaps an escape pattern %g{Gc} (and complement %G{Gc}) where Gc is a one- or two-letter General Category abbreviation like L, Lu, Lo, M, Mn, P, Ps, Pe matching any character which does/doesn't belong to that General Category. The curlies would of course have to be required so that one can still use the regular character class %g including %g%{ with a following curly, or perhaps %k{Gc} as if "Kategory"!

The use case is a function for titlecasing words

-- Helper function
local function ul (u, l)
  return utf8.upper(u) .. utf8.lower(l)
end

local function title_case (s)
  -- Add flanking non-word chars so frontier assertion works at start/end
  s = '(' .. s .. ')'
  s = utf8.gsub(s,'%f[%w](%a)([^%s%d%p%c]*)%f[%W]', ul)
  -- Remove dummy parens
  return utf8.sub(s, 2, -2)
end

That [^%s%d%p%c]* has worked so far for my data but it's ugly, it works by accident and there may be things which it matches which it shouldn't although it seems this module includes GC S in %p.

starwing commented 3 years ago

the logic of Lua pattern is just one letter for one function, support multiple letter pattern may difficult, maybe another matching library is needed. pattern matching in this library just for compatible with Lua's.

So maybe it's worth to considering whether is there any alternatives for pattern matching that support unicode fully?

bpj commented 3 years ago

I guess I could use lrexlib but then patterns will be entirely incompatible with the Lua pattern syntax when my idea is that programs using my (MoonScript) class can supply functions with similar semantics to use instead of string.match etc. to use by methods of the library, with a pure Lua "mode" still possible. I guess I could write (regex-based) code to translate a superset of Lua pattern syntax into PCRE, but that may easily get bigger than the host library itself.

starwing commented 3 years ago

maybe you could make tables (using scripts from this project) and make a new module for check the Unicode categories.