Unicode support for word boundary `\b`

gondalez commented 6 years ago

Is it possible to extend the unicode support to the word boundary anchor?

For example the russian sentence cannot be split:

"hello there this is a test".split(XRegExp('\\b', 'A'))
(11) ["hello", " ", "there", " ", "this", " ", "is", " ", "a", " ", "test"]

"Сняли не первый раз изначальную и конечную сумму и начальную не вернули !!!".split(XRegExp('\\b', 'A'))
["Сняли не первый раз изначальную и конечную сумму и начальную не вернули !!!"]

^ note the split has no effect on russian

The equivalent and desired behaviour in ruby, for example:

irb(main):001:0> "hello there this is a test".split(/\b/)
[
  "hello",
  " ",
  "there",
  " ",
  "this",
  " ",
  "is",
  " ",
  "a",
  " ",
  "test"
]
irb(main):002:0> "Сняли не первый раз изначальную и конечную сумму и начальную не вернули !!!".split(/\b/)
[
  "Сняли",
  " ",
  "не",
  " ",
  "первый",
  " ",
  "раз",
  " ",
  "изначальную",
  " ",
  "и",
  " ",
  "конечную",
  " ",
  "сумму",
  " ",
  "и",
  " ",
  "начальную",
  " ",
  "не",
  " ",
  "вернули",
  " !!!"
]

slevithan commented 6 years ago

Unfortunately, emulating Unicode word boundaries would require native lookbehind support, which is only just being added to the JS spec in EcmaScript 2018. When support spreads to all modern browsers, it will be possible to take this on.

gondalez commented 6 years ago

No problem, thanks for the explanation @slevithan 👍

gausie commented 3 years ago

@slevithan can this be implemented now? Is this already available?

slevithan commented 3 years ago

Yes, this is possible now in ES2018 environments.

But first you need to define what a Unicode word character is. I'll use the rough approximation \p{L}\p{M}*, which matches any Unicode letter followed by any number of Unicode combining marks.

That leads to the following way to emulate a Unicode-aware word boundary (\b):

(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))

Or breaking it down with XRegExp-style free spacing and comments to explain it:

# Either:
(?:
  # The position is preceded by a Unicode word character
  (?<= \p{L}\p{M}* )
  # And the same position is not followed by a Unicode word character
  (?!  \p{L}\p{M}* )
# Or:
|
  # The position is not preceded by a Unicode word character
  (?<! \p{L}\p{M}* )
  # And the same position is followed by a Unicode word character
  (?=  \p{L}\p{M}* )
)

And here's how to emulate a Unicode-aware non-word-boundary (\B):

(?:(?<=\p{L}\p{M}*)(?=\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?!\p{L}\p{M}*))

If you wanted to add support for Unicode aware \b to XRegExp and hide it behind XRegExp's existing A (astral) flag, you could do the following:

XRegExp.addToken(
  /\\b/,
  () => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
  {flag: 'A'}
);

Or if you also wanted to support inverse Unicode word boundaries (\b and \B):

XRegExp.addToken(
  /\\([bB])/,
  (match) => {
    const inverse = match[1] === 'B';
    return inverse ?
      String.raw`(?:(?<=\p{L}\p{M}*)(?=\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?!\p{L}\p{M}*))` :
      String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`;
  },
  {flag: 'A'}
);

Alternatively, you could avoid overloading the A flag and instead give this handling its own flag, such as b. That would just require changing {flag: 'A'} to {flag: 'b'} in the code above.

Note that by not specifying a scope for the tokens added, we're using default scope. That means that \b and \B will only be transformed when they are used outside of character classes ([...]). This is intentional, since \b has a different meaning within character classes in standard JS (it matches a backspace character), and \b or \B within character classes is an error in XRegExp.

Heads up that this is untested. Also heads up that \p{...} doesn't have the intended meaning in ES2018 native regexes unless using flag u, so after adding the above XRegExp tokens you'd have to use flags A and u with your regex to make it work (e.g., XRegExp.tag('Au')`\b` or XRegExp(String.raw`\b`, 'Au'). That's fine if you always remember to use both, but there are two ways you could further improve that to avoid the problem if you forget:

Make it an error to use \b or \B with flag A unless flag u is also present (by checking for flag u within the token handler function shown above, and throwing an error if it's not present).
Use XRegExp.addToken's reparse option. This will lead to XRegExp handling/parsing the generated \p{L}\p{M} tokens in the output, rather than deferring to native syntax. That should resolve the issue since XRegExp doesn't need flag u to transform \p{...} tokens into syntax supported by native regexes (with or without flag u).

I don't expect to add built-in support for Unicode word boundaries to XRegExp in the short term, but hopefully the details above are enough to add support within your own code.

OultimoCoder commented 1 year ago

Thanks so much for the above code! Would love to get inbuilt support for this in the future!

mgoldenbe commented 9 months ago

XRegExp.addToken(
  /\\b/,
  () => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
  {flag: 'A'}
);

This did not work for me. Here is my code:

XRegExp = require('xregexp')
base = require('xregexp/lib/addons/unicode-base')
base(XRegExp)
XRegExp.addToken(
  /\\b/,
  () => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
  {flag: 'A'}
);
console.log(XRegExp.exec("ааа бб вв", XRegExp(/\bбб\b/), "uA")) // null

What am I doing wrong?

slevithan commented 9 months ago

@mgoldenbe the code works fine but you are incorrectly passing "uA" as a third argument to XRegExp.exec rather than as the second (flags) argument to the XRegExp constructor.

However, I prepared a long reply about an additional issue based on my initial misreading of your б characters (U+0431, Cyrillic Small Letter Be) as sixes. So I'll go ahead and include it below even though you might not need it.

Your code above is working as intended. See:

XRegExp.addToken(
  /\\b/,
  () => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
  {flag: 'A'}
);
const nativeWordBoundary = /\bXX\b/;
const unicodeLetterBoundary = XRegExp.tag('Au')`\bXX\b`;

nativeWordBoundary.test('愛XX愛'); // true
unicodeLetterBoundary.test('愛XX愛'); // false
unicodeLetterBoundary.test('XX'); // true

However, it seems you missed this from my comment above:

But first you need to define what a Unicode word character is. I'll use the rough approximation \p{L}\p{M}*, which matches any Unicode letter followed by any number of Unicode combining marks.

Note that native JS regex word boundaries treat ASCII letters, ASCII numbers, and underscore as "word characters". But above I defined a word character merely as a close approximation of a complete Unicode letter. I did not include any numbers (ASCII or otherwise) or underscore.

Based on your above code where you expected the number "6" to be treated as a word character, I'm guessing this was not the definition of "word character" you were looking for. You can change it to anything you want while following the overall code in my comment.

For example, here's a slight modification of my example code that supports Unicode-aware versions of both \b and \B (behind flags 'Au') and that treats any Unicode letter, Unicode number, or underscore as a word character:

XRegExp.addToken(
  /\\([bB])/,
  (match) => {
    const inverse = match[1] === 'B';
    const unicodeLetter = String.raw`\p{L}\p{M}*`;
    const unicodeNumber = String.raw`\p{N}`;
    const other = '_';
    const w = `(?:(?:${unicodeLetter})|(?:${unicodeNumber})|(?:${other}))`;
    return inverse ?
      `(?:(?<=${w})(?=${w})|(?<!${w})(?!${w}))` :
      `(?:(?<=${w})(?!${w})|(?<!${w})(?=${w}))`;
  },
  {flag: 'A'}
);

XRegExp.exec("ааа бб вв", XRegExp.tag('u')`\bбб\b`); // null
XRegExp.exec("ааа бб вв", XRegExp.tag('Au')`\bбб\b`); // ['бб', index: 4, ...]

mgoldenbe commented 9 months ago

@slevithan Thank you for the detailed reply! In the meanwhile, I discovered this post. I am wondering whether there is advantage (other than the aesthetic pleasantness of \b) to using XRegExp compared to the plain JS solutions there.

slevithan commented 9 months ago

@mgoldenbe there are a couple potential advantages to using the XRegExp addon above over the solution in that post, especially if you're already including XRegExp in your code:

You can share/reuse your regex patterns with other programming languages that also use Unicode-aware \b and \B.
You can freely use Unicode-aware word boundaries in all patterns rather than going through complicated concatenation or function calls to build each regex when you need it (i.e., aesthetic pleasantness at scale).

And you get extra polish for free like erroring when trying to use word boundaries in character classes, support for non-word-boundaries (\B), and pattern caching for better performance.

For many other XRegExp addons (that don't rely on native lookbehind support like this does), XRegExp would also give you the advantage of working in all ES5+ browsers.

slevithan / xregexp

Unicode support for word boundary `\b` #228