jgm / skylighting

A Haskell syntax highlighting library with tokenizers derived from KDE syntax highlighting descriptions
189 stars 61 forks source link

atWordBoundary seems wrong #180

Closed numero-744 closed 7 months ago

numero-744 commented 7 months ago

TL;DR: being fixed in #181

Hello,

Test regexp

<RegExpr attribute="test" String="from\b" />
<RegExpr attribute="test" String="u\b" />
Full test.xml ```xml ```

Test code

from frome u ue

Kate render

from frome u ue
from frome u ue
:heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:

Kate is :heavy_check_mark:

Pandoc render

from frome u ue
from frome u ue
:heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :x:

The u in ue is highlighted as a keyword, despite:

  1. in the expression it is followed by a word boundary \b
  2. in the text it is not followed by a word boundary because of the e.

Suspected source code

Disclaimer:

  1. I'm far from being a Haskell dev
  2. I have never worked on a regexp engine

skylighting-core/src/Regex/KDE/Match.hs:157

atWordBoundary :: Match -> Bool
atWordBoundary m =
  case matchOffset m of
    0 -> True
    n | n == B.length (matchBytes m) -> True
      | otherwise ->
           case lastCharOffset (matchBytes m) (matchOffset m) of
             Nothing  -> True
             Just off ->
               case U.toString (B.drop (off - 1) (matchBytes m)) of
                 (prev:cur:next:_) ->
                   (isWordChar cur /= isWordChar next) ||
                   (isWordChar cur /= isWordChar prev)
                 _ -> True

I do not understand why we are looking at prev at all: we only need to match the characters right before and after the 0-len \b, don't we? https://stackoverflow.com/questions/24587090/what-is-the-equivalent-of-word-boundary-meta-character-in-java

My detailed explanation - `0 -> True`: Start of text is a word boundary - `n | n == B.length (matchBytes m) -> True`: End of text is a word boundary - `off` is the index of the char preceding the `\b` match (`u` above) - the `case` after `Just off ->` matches `text[off - 1:]` (sorry for this Python notation :sweat_smile:) - so in the example above we would have: - `prev` is the space - `cur` is the `u` - `next` is the `e` - hence the boolean expression is true (cur ≠ prev branch)

Also I'm wondering if using off - 1 instead of lastCharOffset is safe.

EDIT: I have found other issues and simplifications, see the PR

Suggested modification

:warning: absolutely untested modification

               case U.toString (B.drop off (matchBytes m)) of
                 (cur:next:_) -> isWordChar cur /= isWordChar next

EDIT: This is not enough, see my PR

And a test for skylighting-core/test/test-skylighting.hs (must fail before modification #TDD)

  , ("u\\b", "ue", Nothing)

Thanks for reading! :heart: