jgm / skylighting

A Haskell syntax highlighting library with tokenizers derived from KDE syntax highlighting descriptions
194 stars 62 forks source link

wrong highlighting for json with unicode character ó #90

Closed SrAceves closed 4 years ago

SrAceves commented 4 years ago

Running

pandoc --standalone --highlight-style=tango example.md -o example.html

on example.md with the following fenced json code:

    ```json
    {
        "descripción" : "it causes it to apply the error style"
    }
applies an error style seemingly because it includes the character `ó` -- i.e. unicode `U+00F3`.

If you run it without the offending `ó`, it applies the correct style:
```json
{
    "descripcion" : "it causes it to apply the error style"
}
```

The bug seems to be in the regex engine of skylighting as it is correctly rendered in kate -- with the same `json.xml` grammar definition.
jgm commented 4 years ago

Is the source UTF-8 encoded? We are enabling UTF-8 mode for the regex engine. [I guess if it weren't, pandoc would have thrown an error.]

SrAceves commented 4 years ago

Is the source UTF-8 encoded? We are enabling UTF-8 mode for the regex engine.

Yes

jgm commented 4 years ago

Trace output:

Trying rule Rule {rMatcher = RegExpr (RE {reString = "\\\\(?:[\"\\\\/bfnrt]|u[0-9a-fA-f]{4})", reCaseSensitive = True}), rAttribute = CharTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
FALLTHROUGH Just (DataTypeTok,"descripci\243n\"")

Compare without the accent:

Trying rule Rule {rMatcher = RegExpr (RE {reString = "\\\\(?:[\"\\\\/bfnrt]|u[0-9a-fA-f]{4})", reCaseSensitive = True}), rAttribute = CharTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
FALLTHROUGH Just (DataTypeTok,"descripcion")

So when the accent is present, the fallthrough case is capturing the final ", and that's what causes the problem. No idea why.