jgm / skylighting

A Haskell syntax highlighting library with tokenizers derived from KDE syntax highlighting descriptions
195 stars 63 forks source link

In C, floating-point literals are partially highlighted as errors #174

Closed bdesham closed 1 year ago

bdesham commented 1 year ago

Consider this C file:

double x = 0.5;

With skylighting 0.13.4, the .5 part is highlighted as an error:

$ skylighting --format=native example.c
[ [ ( DataTypeTok , "double" )
  , ( NormalTok , " x " )
  , ( OperatorTok , "=" )
  , ( NormalTok , " " )
  , ( DecValTok , "0" )
  , ( ErrorTok , ".5" )
  , ( OperatorTok , ";" )
  ]
]

If I use the C++ syntax instead, I get the expected behavior where the entire 0.5 is treated as a single number:

$ skylighting --format=native --syntax=cpp example.c
[ [ ( DataTypeTok , "double" )
  , ( NormalTok , " x " )
  , ( OperatorTok , "=" )
  , ( NormalTok , " " )
  , ( FloatTok , "0.5" )
  , ( OperatorTok , ";" )
  ]
]

Please let me know if there’s any further information I can provide that would be helpful!

jgm commented 1 year ago

Here's the relevant part of --trace output:

Trying rule Rule {rMatcher = RegExpr (RE {reString = "\\.?[0-9]", reCaseSensitive = True}), rAttribute = NormalTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = True, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","Number")]}
RegExpr MATCHED Nothing
CONTEXT STACK ["Number","Normal"]
Trying rule Rule {rMatcher = WordDetect "0", rAttribute = DecValTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","IntSuffix")]}
WordDetect MATCHED Just (DecValTok,"0")
CONTEXT STACK ["IntSuffix","Number","Normal"]
Trying rule Rule {rMatcher = DetectChar '\'', rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
Trying rule Rule {rMatcher = RegExpr (RE {reString = "([Uu](LL?|ll?|wb|WB)?|(LL?|ll?|wb|WB)[Uu]?)\\b", reCaseSensitive = True}), rAttribute = BuiltInTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","NumericSuffixError")]}
CONTEXT STACK ["NumericSuffixError","IntSuffix","Number","Normal"]
Trying rule Rule {rMatcher = AnyChar (fromList "'.0123456789"), rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
AnyChar MATCHED Just (ErrorTok,".")
Trying rule Rule {rMatcher = AnyChar (fromList "'.0123456789"), rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
AnyChar MATCHED Just (ErrorTok,"5")
Trying rule Rule {rMatcher = AnyChar (fromList "'.0123456789"), rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
Trying rule Rule {rMatcher = DetectIdentifier, rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
CONTEXT STACK ["IntSuffix","Number","Normal"]
CONTEXT STACK ["Number","Normal"]
CONTEXT STACK ["Normal"]
Trying rule Rule {rMatcher = DetectSpaces, rAttribute = NormalTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
Trying rule Rule {rMatcher = AnyChar (fromList "!&()*+,-:;=>]^|~"), rAttribute = OperatorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
AnyChar MATCHED Just (OperatorTok,";")
double x = 0.5;
jgm commented 1 year ago

looks like that first match is wrong: we don't match the regex /\.[0-9]/ when we're at 0.5. never mind, there is a ? after the \. so it does match.

jgm commented 1 year ago

Just going through the syntax description manually, I can't see where the above trace is going wrong....do you?

bdesham commented 1 year ago

I’m not really familiar with how to read these XML files, but the trace seems to show that the zero is being matched separately from the rest of the numeric literal, and if I change the zero to another digit I do get the results I expect:

$ cat example2.c
double x = 1.5;

$ skylighting --format=native example2.c
[ [ ( DataTypeTok , "double" )
  , ( NormalTok , " x " )
  , ( OperatorTok , "=" )
  , ( NormalTok , " " )
  , ( FloatTok , "1.5" )
  , ( OperatorTok , ";" )
  ]
]
bdesham commented 1 year ago

Looking at the traces, both the 0.5 example and the 1.5 example share the line

Trying rule Rule {rMatcher = WordDetect "0", rAttribute = DecValTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","IntSuffix")]}

After that—up to but not including the parsing of the semicolon at the end of the line—the successful example has

Trying rule Rule {rMatcher = RegExpr (RE {reString = "\\.(?:[0-9](?:'?[0-9]++)*+)(?:[eE][+-]?(?:[0-9](?:'?[0-9]++)*+))?|0[xX](?:\\.(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]++)*+)(?:[pP][-+]?(?:[0-9](?:'?[0-9]++)*+))?|(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]++)*+)(?:(?:[pP][-+]?(?:[0-9](?:'?[0-9]++)*+))|\\.(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]++)*+)?(?:[pP][-+]?(?:[0-9](?:'?[0-9]++)*+))?))|(?:[0-9](?:'?[0-9]++)*+)(?:(?:[eE][+-]?(?:[0-9](?:'?[0-9]++)*+))|\\.(?:[0-9](?:'?[0-9]++)*+)?(?:[eE][+-]?(?:[0-9](?:'?[0-9]++)*+))?)", reCaseSensitive = True}), rAttribute = FloatTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","FloatSuffix")]}
RegExpr MATCHED Just (FloatTok,"1.5")
CONTEXT STACK ["FloatSuffix","Number","Normal"]
Trying rule Rule {rMatcher = DetectChar '\'', rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
Trying rule Rule {rMatcher = AnyChar (fromList "FLfl"), rAttribute = BuiltInTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","NumericSuffixError")]}
Trying rule Rule {rMatcher = Detect2Chars 'd' 'f', rAttribute = BuiltInTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","NumericSuffixError")]}
Trying rule Rule {rMatcher = Detect2Chars 'D' 'F', rAttribute = BuiltInTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","NumericSuffixError")]}
Trying rule Rule {rMatcher = Detect2Chars 'd' 'd', rAttribute = BuiltInTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","NumericSuffixError")]}
Trying rule Rule {rMatcher = Detect2Chars 'D' 'D', rAttribute = BuiltInTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","NumericSuffixError")]}
Trying rule Rule {rMatcher = Detect2Chars 'd' 'l', rAttribute = BuiltInTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","NumericSuffixError")]}
Trying rule Rule {rMatcher = Detect2Chars 'D' 'L', rAttribute = BuiltInTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","NumericSuffixError")]}
CONTEXT STACK ["NumericSuffixError","FloatSuffix","Number","Normal"]
Trying rule Rule {rMatcher = AnyChar (fromList "'.0123456789"), rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
Trying rule Rule {rMatcher = DetectIdentifier, rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
CONTEXT STACK ["FloatSuffix","Number","Normal"]
CONTEXT STACK ["Number","Normal"]
CONTEXT STACK ["Normal"]

while the buggy example has

WordDetect MATCHED Just (DecValTok,"0")
CONTEXT STACK ["IntSuffix","Number","Normal"]
Trying rule Rule {rMatcher = DetectChar '\'', rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
Trying rule Rule {rMatcher = RegExpr (RE {reString = "([Uu](LL?|ll?|wb|WB)?|(LL?|ll?|wb|WB)[Uu]?)\\b", reCaseSensitive = True}), rAttribute = BuiltInTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","NumericSuffixError")]}
CONTEXT STACK ["NumericSuffixError","IntSuffix","Number","Normal"]
Trying rule Rule {rMatcher = AnyChar (fromList "'.0123456789"), rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
AnyChar MATCHED Just (ErrorTok,".")
Trying rule Rule {rMatcher = AnyChar (fromList "'.0123456789"), rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
AnyChar MATCHED Just (ErrorTok,"5")
Trying rule Rule {rMatcher = AnyChar (fromList "'.0123456789"), rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
Trying rule Rule {rMatcher = DetectIdentifier, rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
CONTEXT STACK ["IntSuffix","Number","Normal"]
CONTEXT STACK ["Number","Normal"]
CONTEXT STACK ["Normal"]

It really looks like the buggy case treats “0” as an integer literal.

jgm commented 1 year ago

So the question is why WordDetect "0" is there! Presumably it shouldn't be matching 0 in 0.5; I'll have to look at the implementation of WordDetect..

jgm commented 1 year ago

I understand now. Full rule is

        <WordDetect attribute="Decimal" context="IntSuffix" String="0" weakDeliminator="."/>

Here weakDeliminator adds characters that are allowed to belong to the word and hence not treated as word separators. We're apparently ignoring that and treating . as a word separator.

bdesham commented 1 year ago

Awesome, thank you very much 😊