jgm / skylighting

A Haskell syntax highlighting library with tokenizers derived from KDE syntax highlighting descriptions
194 stars 62 forks source link

skylightling-0.7.7 crashes on integer literal of length 22 or higher #81

Closed trofi closed 5 years ago

trofi commented 5 years ago

Noticed by accident when noticed that one of my old blog posts (render is hakyll/skylighting) started crashing on render. Minimal reproducer:

$ echo 'int main(){ return 111111111111111111111;}' | skylighting --syntax=c
int main(){ return 111111111111111111111;}
$ echo 'int main(){ return 1111111111111111111111;}' | skylighting --syntax=c
skylighting: RegexException "Error in Text.Regex.PCRE.Wrap: ReturnCode (-8)"

Could skylighting improve error message or skip an error?

trofi commented 5 years ago

I realized 0.7.7 is not the most recent version. Built 0.8.1.1 locally and verified it fails the same:

$ ./skylighting --version
skylighting 0.8.1.1
$ echo 'int main(){ return 1111111111111111111111;}' | ./skylighting --syntax=c
skylighting: RegexException "Error in Text.Regex.PCRE.Wrap: ReturnCode (-8)"
$ echo 'int main(){ return 111111111111111111111;}' | ./skylighting --syntax=c
int main(){ return 111111111111111111111;}
jgm commented 5 years ago

--trace is helpful:

Trying rule Rule {rMatcher = RegExpr (RE {reString = "\\.(?:[0-9](?:'?[0-9]+)*)(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))?|(?:[0-9](?:'?[0-9]+)*)(?:(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))|\\.(?:[0-9](?:'?[0-9]+)*)?(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))?)|0[xX](?:\\.(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))?|(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)(?:(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))|\\.(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)?(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))?))", reCaseSensitive = True}), rAttribute = FloatTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","FloatSuffix")]}
skylighting: RegexException "Error in Text.Regex.PCRE.Wrap: ReturnCode (-8)"

This tells you the regex it crashed on.

jgm commented 5 years ago

Corresponding bit of c.xml:

        <RegExpr attribute="Float" context="FloatSuffix" String="\.&int;&exp_float;?|&int;(?:&exp_float;|\.&int;?&exp_float;?)|0[xX](?:\.&hex_int;&exp_hexfloat;?|&hex_int;(?:&exp_hexfloat;|\.&hex_int;?&exp_hexfloat;?))" />

Added in an update Feb. 2019.

jgm commented 5 years ago

In pcre.h I see error code -8:

#define PCRE_ERROR_MATCHLIMIT       (-8)
jgm commented 5 years ago

Not sure what the issue is with the regex. As you suggest, it might be reasonable to change the code so that regexes silently fail to match if an exception is raised. The only problem is that this may lead problems to escape our notice.

skvadrik commented 5 years ago

PCRE uses exhaustive backtracking matching algorithm. When presented with alternative, it tries to match the first branch; if no luck, it tries the second and so on. Worst-case time compexity of this process is exponential. PCRE_ERROR_MATCHLIMIT is a hard limit to stop in case the process takes too long, as explained here:

PCRE_ERROR_MATCHLIMIT (-8) The backtracking limit, as specified by the match_limit field in a pcre_extra structure (or defaulted) was reached. See the description above.

The problem with the above regexp is that it doesn't match simple integer numbers. It does match .123 or 123e4 or 0x.123 or 0x123p4, but not 123 or 0x123. Because simple numbers (rejected by the regexp) are prefixes of more complex numbers (accepted by the regex), PCRE has to assume that eventually it will find e4 suffix after 123 (or some such), and it goes on and on, exhausting all possible alternatives, making exponential number of steps.

I can think of the following fixes:

Option 1. Add simple integer patterns to the regexp. Some care is needed to make complex pattern match first, namely, the more complex alternative must occur before the simpler one. In our case, it amounts to adding one question mark to make exponent part optional (123e1 will still match before 123), and putting hexadecimal alternative before the rest so that 0x123 matches as a whole and not as 0 and 123. The resulting regexp:

reString = "0[xX](?:\\.(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))?|(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)(?:(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))?|\\.(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)?(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))?))|\\.(?:[0-9](?:'?[0-9]+)*)(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))?|(?:[0-9](?:'?[0-9]+)*)(?:(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))|\\.(?:[0-9](?:'?[0-9]+)*)?(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))?)?"

GHCI session:

*Main X> "1111111111111111111111" =~ reString :: [[String]]  
[["1111111111111111111111"]]
*Main X> "1111111111111111111111e1" =~ reString :: [[String]]
[["1111111111111111111111e1"]]
*Main X> ".1111111111111111111111" =~ reString :: [[String]] 
[[".1111111111111111111111"]]
*Main X> "0x1111111111111111111111" =~ reString :: [[String]] 
[["0x1111111111111111111111"]]
*Main X> "0x1111111111111111111111p1" =~ reString :: [[String]]
[["0x1111111111111111111111p1"]]

Option 2. Don't use backtracking engines, use automata-based ones (e.g. TRE or Haskell bindings to RE2). They treat all alternatives as equal. :)

jgm commented 5 years ago

Thanks for the diagnosis. Note that the regex comes from KDE's syntax highlighting file for C; we don't control that and I would prefer not to make local changes unless absolutely necessary. Of course it could be changed upstream, too, but my guess is that it probably works fine with KDE's regex engine. (One could confirm by trying this in the Kate editor.)

I'm not sure what regex engine KDE's syntax highlighting uses. The syntax seems PCRE, but they may be using a non-backtracking engine. In any case, I chose PCRE because the regexes include syntax not supported in other Haskell regex libraries, and because via regex-pcre-builtin we have a solution that doesn't require separate installation of a C library, which is an obstacle for some users. re2 might satisfy the first criterion, but not the second.

jgm commented 5 years ago

The problem with your Option 1 is that this matcher is only supposed to match floats; Integers are matched by a separate one; this change would result in integers being tagged as floats, which we don't want.

jgm commented 5 years ago

The branch no-regex-exception uses a "just plow ahead" strategy and treats the regex failure as a failed match rather than raising an exception. This leaves us without any information about the problem, though, so I'm reluctant to use this solution.

skvadrik commented 5 years ago

The problem with your Option 1 is that this matcher is only supposed to match floats; Integers are matched by a separate one; this change would result in integers being tagged as floats, which we don't want.

There are many options (e.g. using lexer generators like alex for tokenization), but as I understand you cannot change the regexp, so this is indeed an obstacle.

re2 might satisfy the first criterion, but not the second.

Regex.RE2 supports Perl syntax (even by default), and RE2 returns no-match instantly for the original regexp and integer. Interestingly, Text.Regex.PCRE.Heavy also works and returns no-match. Not sure what is the difference with Text.Regex.PCRE. Both of them are non-builtin though.

skvadrik commented 5 years ago

Perhaps the reason Text.Regex.PCRE.Heavy and Kate work is that they increase default match limit.

In C++ PCRE API, it is possible to override match limit by setting flag PCRE_EXTRA_MATCH_LIMIT and increasing match_limit field (default is 10000000) in a structure of type pcre_extra, which is then passed to pcre_exec(). (In our case a factor of 10 is enough.) In Haskell API I don't see any corresponding option.

trofi commented 5 years ago

I looked at the kate editor: it indeed highlights long literals correctly.

It also has convenient command line tool:

$echo '11111111111111111111111111111111' | kate-syntax-highlighter -s C --stdin

<!DOCTYPE html>
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Kate Syntax Highlighter</title>
<meta name="generator" content="KF5::SyntaxHighlighting (C)"/>
</head><body style="color:#1f1c1b"><pre>
<span style="color:#b08000;">11111111111111111111111111111111</span>
</pre></body></html>

I don't know yet why the difference is there. Glancing at library trace of kate-syntax-highlighter it uses libpcre2-16.so.0 to match regexes (PCRE2 in UTF-16 mode?). I'll try to find out why it does not fail the same as PCRE1.

trofi commented 5 years ago

kate matches successfully because syntax-highlighter seems to ignore all match errors: be it no match (pcre2_match returns -1) or internal errors (pcre2_match returns -47).

In this case there is another simpler regex that matches this literal successfully: "0(?![xXbB0-9])|[1-9](?:'?[0-9]+)*". Probably comes from:

isocpp.xml: <RegExpr attribute="Decimal" context="IntSuffix" String="0(?![xXbB0-9])|[1-9](?:'?[0-9]+)*" />
jgm commented 5 years ago

Thanks for tracking this down. The branch mentioned above changes skylighting so that regex match errors are treated as "no match" rather than raising an exception, which would be exactly what kate is doing. I suppose that's the way to go -- my reluctance is only because I've fixed some errors in the past because of regex match exceptions.

jgm commented 5 years ago

In the end I left the exception for compileRegex and removed it for matchRegex.