Open gondalez opened 6 years ago
Unfortunately, emulating Unicode word boundaries would require native lookbehind support, which is only just being added to the JS spec in EcmaScript 2018. When support spreads to all modern browsers, it will be possible to take this on.
No problem, thanks for the explanation @slevithan 👍
@slevithan can this be implemented now? Is this already available?
Yes, this is possible now in ES2018 environments.
But first you need to define what a Unicode word character is. I'll use the rough approximation \p{L}\p{M}*
, which matches any Unicode letter followed by any number of Unicode combining marks.
That leads to the following way to emulate a Unicode-aware word boundary (\b
):
(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))
Or breaking it down with XRegExp-style free spacing and comments to explain it:
# Either:
(?:
# The position is preceded by a Unicode word character
(?<= \p{L}\p{M}* )
# And the same position is not followed by a Unicode word character
(?! \p{L}\p{M}* )
# Or:
|
# The position is not preceded by a Unicode word character
(?<! \p{L}\p{M}* )
# And the same position is followed by a Unicode word character
(?= \p{L}\p{M}* )
)
And here's how to emulate a Unicode-aware non-word-boundary (\B
):
(?:(?<=\p{L}\p{M}*)(?=\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?!\p{L}\p{M}*))
If you wanted to add support for Unicode aware \b
to XRegExp and hide it behind XRegExp's existing A
(astral) flag, you could do the following:
XRegExp.addToken(
/\\b/,
() => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
{flag: 'A'}
);
Or if you also wanted to support inverse Unicode word boundaries (\b
and \B
):
XRegExp.addToken(
/\\([bB])/,
(match) => {
const inverse = match[1] === 'B';
return inverse ?
String.raw`(?:(?<=\p{L}\p{M}*)(?=\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?!\p{L}\p{M}*))` :
String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`;
},
{flag: 'A'}
);
Alternatively, you could avoid overloading the A
flag and instead give this handling its own flag, such as b
. That would just require changing {flag: 'A'}
to {flag: 'b'}
in the code above.
Note that by not specifying a scope for the tokens added, we're using default
scope. That means that \b
and \B
will only be transformed when they are used outside of character classes ([...]
). This is intentional, since \b
has a different meaning within character classes in standard JS (it matches a backspace character), and \b
or \B
within character classes is an error in XRegExp.
Heads up that this is untested. Also heads up that \p{...}
doesn't have the intended meaning in ES2018 native regexes unless using flag u
, so after adding the above XRegExp tokens you'd have to use flags A
and u
with your regex to make it work (e.g., XRegExp.tag('Au')`\b`
or XRegExp(String.raw`\b`, 'Au')
. That's fine if you always remember to use both, but there are two ways you could further improve that to avoid the problem if you forget:
\b
or \B
with flag A
unless flag u
is also present (by checking for flag u
within the token handler function shown above, and throwing an error if it's not present).reparse
option. This will lead to XRegExp handling/parsing the generated \p{L}\p{M}
tokens in the output, rather than deferring to native syntax. That should resolve the issue since XRegExp doesn't need flag u
to transform \p{...}
tokens into syntax supported by native regexes (with or without flag u
).I don't expect to add built-in support for Unicode word boundaries to XRegExp in the short term, but hopefully the details above are enough to add support within your own code.
Thanks so much for the above code! Would love to get inbuilt support for this in the future!
XRegExp.addToken( /\\b/, () => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`, {flag: 'A'} );
This did not work for me. Here is my code:
XRegExp = require('xregexp')
base = require('xregexp/lib/addons/unicode-base')
base(XRegExp)
XRegExp.addToken(
/\\b/,
() => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
{flag: 'A'}
);
console.log(XRegExp.exec("ааа бб вв", XRegExp(/\bбб\b/), "uA")) // null
What am I doing wrong?
@mgoldenbe the code works fine but you are incorrectly passing "uA"
as a third argument to XRegExp.exec
rather than as the second (flags
) argument to the XRegExp
constructor.
However, I prepared a long reply about an additional issue based on my initial misreading of your б
characters (U+0431, Cyrillic Small Letter Be) as sixes. So I'll go ahead and include it below even though you might not need it.
Your code above is working as intended. See:
XRegExp.addToken(
/\\b/,
() => String.raw`(?:(?<=\p{L}\p{M}*)(?!\p{L}\p{M}*)|(?<!\p{L}\p{M}*)(?=\p{L}\p{M}*))`,
{flag: 'A'}
);
const nativeWordBoundary = /\bXX\b/;
const unicodeLetterBoundary = XRegExp.tag('Au')`\bXX\b`;
nativeWordBoundary.test('愛XX愛'); // true
unicodeLetterBoundary.test('愛XX愛'); // false
unicodeLetterBoundary.test('XX'); // true
However, it seems you missed this from my comment above:
But first you need to define what a Unicode word character is. I'll use the rough approximation
\p{L}\p{M}*
, which matches any Unicode letter followed by any number of Unicode combining marks.
Note that native JS regex word boundaries treat ASCII letters, ASCII numbers, and underscore as "word characters". But above I defined a word character merely as a close approximation of a complete Unicode letter. I did not include any numbers (ASCII or otherwise) or underscore.
Based on your above code where you expected the number "6"
to be treated as a word character, I'm guessing this was not the definition of "word character" you were looking for. You can change it to anything you want while following the overall code in my comment.
For example, here's a slight modification of my example code that supports Unicode-aware versions of both \b
and \B
(behind flags 'Au'
) and that treats any Unicode letter, Unicode number, or underscore as a word character:
XRegExp.addToken(
/\\([bB])/,
(match) => {
const inverse = match[1] === 'B';
const unicodeLetter = String.raw`\p{L}\p{M}*`;
const unicodeNumber = String.raw`\p{N}`;
const other = '_';
const w = `(?:(?:${unicodeLetter})|(?:${unicodeNumber})|(?:${other}))`;
return inverse ?
`(?:(?<=${w})(?=${w})|(?<!${w})(?!${w}))` :
`(?:(?<=${w})(?!${w})|(?<!${w})(?=${w}))`;
},
{flag: 'A'}
);
XRegExp.exec("ааа бб вв", XRegExp.tag('u')`\bбб\b`); // null
XRegExp.exec("ааа бб вв", XRegExp.tag('Au')`\bбб\b`); // ['бб', index: 4, ...]
@slevithan Thank you for the detailed reply!
In the meanwhile, I discovered this post. I am wondering whether there is advantage (other than the aesthetic pleasantness of \b
) to using XRegExp compared to the plain JS solutions there.
@mgoldenbe there are a couple potential advantages to using the XRegExp addon above over the solution in that post, especially if you're already including XRegExp in your code:
\b
and \B
.And you get extra polish for free like erroring when trying to use word boundaries in character classes, support for non-word-boundaries (\B
), and pattern caching for better performance.
For many other XRegExp addons (that don't rely on native lookbehind support like this does), XRegExp would also give you the advantage of working in all ES5+ browsers.
Is it possible to extend the unicode support to the word boundary anchor?
For example the russian sentence cannot be split:
^ note the split has no effect on russian
The equivalent and desired behaviour in ruby, for example: