Closed trofi closed 5 years ago
I realized 0.7.7 is not the most recent version. Built 0.8.1.1 locally and verified it fails the same:
$ ./skylighting --version
skylighting 0.8.1.1
$ echo 'int main(){ return 1111111111111111111111;}' | ./skylighting --syntax=c
skylighting: RegexException "Error in Text.Regex.PCRE.Wrap: ReturnCode (-8)"
$ echo 'int main(){ return 111111111111111111111;}' | ./skylighting --syntax=c
int main(){ return 111111111111111111111;}
--trace
is helpful:
Trying rule Rule {rMatcher = RegExpr (RE {reString = "\\.(?:[0-9](?:'?[0-9]+)*)(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))?|(?:[0-9](?:'?[0-9]+)*)(?:(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))|\\.(?:[0-9](?:'?[0-9]+)*)?(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))?)|0[xX](?:\\.(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))?|(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)(?:(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))|\\.(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)?(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))?))", reCaseSensitive = True}), rAttribute = FloatTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("C","FloatSuffix")]}
skylighting: RegexException "Error in Text.Regex.PCRE.Wrap: ReturnCode (-8)"
This tells you the regex it crashed on.
Corresponding bit of c.xml:
<RegExpr attribute="Float" context="FloatSuffix" String="\.∫&exp_float;?|∫(?:&exp_float;|\.∫?&exp_float;?)|0[xX](?:\.&hex_int;&exp_hexfloat;?|&hex_int;(?:&exp_hexfloat;|\.&hex_int;?&exp_hexfloat;?))" />
Added in an update Feb. 2019.
In pcre.h I see error code -8:
#define PCRE_ERROR_MATCHLIMIT (-8)
Not sure what the issue is with the regex. As you suggest, it might be reasonable to change the code so that regexes silently fail to match if an exception is raised. The only problem is that this may lead problems to escape our notice.
PCRE uses exhaustive backtracking matching algorithm. When presented with alternative, it tries to match the first branch; if no luck, it tries the second and so on. Worst-case time compexity of this process is exponential. PCRE_ERROR_MATCHLIMIT
is a hard limit to stop in case the process takes too long, as explained here:
PCRE_ERROR_MATCHLIMIT (-8) The backtracking limit, as specified by the
match_limit
field in apcre_extra
structure (or defaulted) was reached. See the description above.
The problem with the above regexp is that it doesn't match simple integer numbers. It does match .123
or 123e4
or 0x.123
or 0x123p4
, but not 123
or 0x123
. Because simple numbers (rejected by the regexp) are prefixes of more complex numbers (accepted by the regex), PCRE has to assume that eventually it will find e4
suffix after 123
(or some such), and it goes on and on, exhausting all possible alternatives, making exponential number of steps.
I can think of the following fixes:
Option 1. Add simple integer patterns to the regexp. Some care is needed to make complex pattern match first, namely, the more complex alternative must occur before the simpler one. In our case, it amounts to adding one question mark to make exponent part optional (123e1
will still match before 123
), and putting hexadecimal alternative before the rest so that 0x123
matches as a whole and not as 0
and 123
. The resulting regexp:
reString = "0[xX](?:\\.(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))?|(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)(?:(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))?|\\.(?:[0-9A-Fa-f](?:'?[0-9A-Fa-f]+)*)?(?:[pP][-+]?(?:[0-9](?:'?[0-9]+)*))?))|\\.(?:[0-9](?:'?[0-9]+)*)(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))?|(?:[0-9](?:'?[0-9]+)*)(?:(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))|\\.(?:[0-9](?:'?[0-9]+)*)?(?:[eE][+-]?(?:[0-9](?:'?[0-9]+)*))?)?"
GHCI session:
*Main X> "1111111111111111111111" =~ reString :: [[String]]
[["1111111111111111111111"]]
*Main X> "1111111111111111111111e1" =~ reString :: [[String]]
[["1111111111111111111111e1"]]
*Main X> ".1111111111111111111111" =~ reString :: [[String]]
[[".1111111111111111111111"]]
*Main X> "0x1111111111111111111111" =~ reString :: [[String]]
[["0x1111111111111111111111"]]
*Main X> "0x1111111111111111111111p1" =~ reString :: [[String]]
[["0x1111111111111111111111p1"]]
Option 2. Don't use backtracking engines, use automata-based ones (e.g. TRE or Haskell bindings to RE2). They treat all alternatives as equal. :)
Thanks for the diagnosis. Note that the regex comes from KDE's syntax highlighting file for C; we don't control that and I would prefer not to make local changes unless absolutely necessary. Of course it could be changed upstream, too, but my guess is that it probably works fine with KDE's regex engine. (One could confirm by trying this in the Kate editor.)
I'm not sure what regex engine KDE's syntax highlighting uses. The syntax seems PCRE, but they may be using a non-backtracking engine. In any case, I chose PCRE because the regexes include syntax not supported in other Haskell regex libraries, and because via regex-pcre-builtin we have a solution that doesn't require separate installation of a C library, which is an obstacle for some users. re2 might satisfy the first criterion, but not the second.
The problem with your Option 1 is that this matcher is only supposed to match floats; Integers are matched by a separate one; this change would result in integers being tagged as floats, which we don't want.
The branch no-regex-exception
uses a "just plow ahead" strategy and treats the regex failure as a failed match rather than raising an exception. This leaves us without any information about the problem, though, so I'm reluctant to use this solution.
The problem with your Option 1 is that this matcher is only supposed to match floats; Integers are matched by a separate one; this change would result in integers being tagged as floats, which we don't want.
There are many options (e.g. using lexer generators like alex for tokenization), but as I understand you cannot change the regexp, so this is indeed an obstacle.
re2 might satisfy the first criterion, but not the second.
Regex.RE2 supports Perl syntax (even by default), and RE2 returns no-match instantly for the original regexp and integer. Interestingly, Text.Regex.PCRE.Heavy also works and returns no-match. Not sure what is the difference with Text.Regex.PCRE. Both of them are non-builtin though.
Perhaps the reason Text.Regex.PCRE.Heavy and Kate work is that they increase default match limit.
In C++ PCRE API, it is possible to override match limit by setting flag PCRE_EXTRA_MATCH_LIMIT
and increasing match_limit
field (default is 10000000) in a structure of type pcre_extra
, which is then passed to pcre_exec()
. (In our case a factor of 10 is enough.) In Haskell API I don't see any corresponding option.
I looked at the kate
editor: it indeed highlights long literals correctly.
It also has convenient command line tool:
$echo '11111111111111111111111111111111' | kate-syntax-highlighter -s C --stdin
<!DOCTYPE html>
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Kate Syntax Highlighter</title>
<meta name="generator" content="KF5::SyntaxHighlighting (C)"/>
</head><body style="color:#1f1c1b"><pre>
<span style="color:#b08000;">11111111111111111111111111111111</span>
</pre></body></html>
I don't know yet why the difference is there. Glancing at library trace of kate-syntax-highlighter
it uses libpcre2-16.so.0
to match regexes (PCRE2 in UTF-16 mode?). I'll try to find out why it does not fail the same as PCRE1.
kate
matches successfully because syntax-highlighter
seems to ignore all match errors: be it no match (pcre2_match
returns -1
) or internal errors (pcre2_match
returns -47
).
In this case there is another simpler regex that matches this literal successfully: "0(?![xXbB0-9])|[1-9](?:'?[0-9]+)*"
. Probably comes from:
isocpp.xml: <RegExpr attribute="Decimal" context="IntSuffix" String="0(?![xXbB0-9])|[1-9](?:'?[0-9]+)*" />
Thanks for tracking this down. The branch mentioned above changes skylighting so that regex match errors are treated as "no match" rather than raising an exception, which would be exactly what kate is doing. I suppose that's the way to go -- my reluctance is only because I've fixed some errors in the past because of regex match exceptions.
In the end I left the exception for compileRegex and removed it for matchRegex.
Noticed by accident when noticed that one of my old blog posts (render is hakyll/skylighting) started crashing on render. Minimal reproducer:
Could skylighting improve error message or skip an error?