jgm / skylighting

A Haskell syntax highlighting library with tokenizers derived from KDE syntax highlighting descriptions
189 stars 61 forks source link

Issue with hex escaped characters #135

Closed Agnishom closed 2 years ago

Agnishom commented 2 years ago

According to regex101.com, regexes could use a \x to escape a hex character. For instance, \x3f would mean ?. This may also appear inside character classes. But the current code does not support this.

https://github.com/jgm/skylighting/blob/6ea49f458362a1084b9e3a51458eaa40def0d8f5/skylighting-core/src/Regex/KDE/Compile.hs#L191

You can add the following lines of code to the pEscaped function to add this functionality.

    'x' -> do -- \xHH matches hex HH
      ds <- A.take 2
      case readMay ("'\\x" ++ U.toString ds ++ "'") of
        Just x -> return x
        Nothing -> fail "invalid hex character escape"
jgm commented 2 years ago

Oh, yes, I see that in fact this syntax is used in the syntax definitions:

skylighting-core/xml/powershell.xml
888:        <RegExpr attribute="Variable" context="#stay" String="\$+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*" />
Agnishom commented 2 years ago

Great. Can you take a look at whether the patch I suggested above will fix it?

jgm commented 2 years ago

Yes, the fix was just right -- but I also added code to handle \x{HH...} which also occurs in the syntax defs. Thanks!

Agnishom commented 2 years ago

According to regex101.com, it seems that \x2 is also an acceptable way of writing \x02. How can I support this style?

Agnishom commented 2 years ago

What do you think of this?

    'x' -> do -- \xhh matches hex hh, \x{h+} matches hex h+
      ds <- (satisfy (== 123) *> A.takeWhile (/= 125) <* satisfy (== 125))
             <|> ( do
                      x1 <- satisfy (inClass "a-fA-F0-9")
                      x2 <- peekWord8
                      case x2 of
                        Nothing -> return $ B.pack [x1]
                        Just x2' -> if inClass "a-fA-F0-9" x2'
                                    then return $ B.pack [x1, x2']
                                    else return $ B.pack [x1]
                  )
jgm commented 2 years ago

peekWord8 doesn't consume the character. So you'd need to be sure to consume the character if it's in the relevant class.

Why not just

x2 <- Just <$> satisfy (inClass "a-fA-F0-9")) <|> pure Nothing
jgm commented 2 years ago

But I'd want to confirm that this works with KDE's regex engine before adding anything to this library.

Agnishom commented 2 years ago

You are right, I had to make that change.

'x' -> do -- \xhh matches hex hh, \x{h+} matches hex h+
      ds <- (satisfy (== 123) *> A.takeWhile (/= 125) <* satisfy (== 125))
             <|> ( do
                      x1 <- satisfy (inClass "a-fA-F0-9")
                      x2 <- peekWord8
                      case x2 of
                        Nothing -> return $ B.pack [x1]
                        Just x2' -> if inClass "a-fA-F0-9" x2'
                                    then anyWord8 >> (return $ B.pack [x1, x2'])
                                    else return $ B.pack [x1]
                  )

also seems to work