atWordBoundary :: Match -> Bool
atWordBoundary m =
case matchOffset m of
0 -> True
n | n == B.length (matchBytes m) -> True
| otherwise ->
case lastCharOffset (matchBytes m) (matchOffset m) of
Nothing -> True
Just off ->
case U.toString (B.drop (off - 1) (matchBytes m)) of
(prev:cur:next:_) ->
(isWordChar cur /= isWordChar next) ||
(isWordChar cur /= isWordChar prev)
_ -> True
My detailed explanation
- `0 -> True`: Start of text is a word boundary
- `n | n == B.length (matchBytes m) -> True`: End of text is a word boundary
- `off` is the index of the char preceding the `\b` match (`u` above)
- the `case` after `Just off ->` matches `text[off - 1:]` (sorry for this Python notation :sweat_smile:)
- so in the example above we would have:
- `prev` is the space
- `cur` is the `u`
- `next` is the `e`
- hence the boolean expression is true (cur ≠ prev branch)
Also I'm wondering if using off - 1 instead of lastCharOffset is safe.
EDIT: I have found other issues and simplifications, see the PR
Suggested modification
:warning: absolutely untested modification
case U.toString (B.drop off (matchBytes m)) of
(cur:next:_) -> isWordChar cur /= isWordChar next
TL;DR: being fixed in #181
Hello,
Test regexp
Full
```xmltest.xml
Test code
Kate render
from
frome
u
ue
Kate is :heavy_check_mark:
Pandoc render
from
frome
u
ue
The
u
inue
is highlighted as a keyword, despite:\b
e
.Suspected source code
skylighting-core/src/Regex/KDE/Match.hs:157
I do not understand why we are looking at
prev
at all: we only need to match the characters right before and after the 0-len\b
, don't we? https://stackoverflow.com/questions/24587090/what-is-the-equivalent-of-word-boundary-meta-character-in-javaMy detailed explanation
- `0 -> True`: Start of text is a word boundary - `n | n == B.length (matchBytes m) -> True`: End of text is a word boundary - `off` is the index of the char preceding the `\b` match (`u` above) - the `case` after `Just off ->` matches `text[off - 1:]` (sorry for this Python notation :sweat_smile:) - so in the example above we would have: - `prev` is the space - `cur` is the `u` - `next` is the `e` - hence the boolean expression is true (cur ≠ prev branch)Also I'm wondering if using
off - 1
instead oflastCharOffset
is safe.EDIT: I have found other issues and simplifications, see the PR
Suggested modification
:warning: absolutely untested modification
EDIT: This is not enough, see my PR
And a test for skylighting-core/test/test-skylighting.hs (must fail before modification #TDD)
Thanks for reading! :heart: