jgm / skylighting

A Haskell syntax highlighting library with tokenizers derived from KDE syntax highlighting descriptions
189 stars 61 forks source link

Support highlighting for Raku (available in KDE repos) #114

Closed notagoodidea closed 3 years ago

notagoodidea commented 3 years ago

I am not sure when and how the xml files are pushed from the KDE repo into the skylighting one but [Raku]() already have been available in the KDE repo - Raku. Is it possible to integrate it in skylighting?

If it is only to provide a pull request with the xml file from KDE, I can do it of course.

jgm commented 3 years ago

I tried adding it. But we get a timeout error in some of the tests. (This particular test checks to make sure we don't get hangs on arbitrary input, such as programs in other languages.)

To reproduce, get raku.xml and use the compiled skylighting executable like so:

skylighting -s raku -d raku.xml --trace skylighting-core/test/cases/abc.fortran

The trace output shows what the tokenizer is trying; if you inspect it you may be able to figure out what is going on.

notagoodidea commented 3 years ago

Thanks, I will look at the issue to see if I can solve it. (And talk about it in the raku irc if someone more knowledgeable/aware of the issue is available).

notagoodidea commented 3 years ago

After looking around with the fortran test, it seems to be due to the DetectChar "Q" rule that make it hang and make it loop infinitly on the line 12 at the QD part (removing the Q letter let the test succeed). After test with files with only QD written in it, it hangs skylighting in a infinite loop with the 'raku.xml`. The hanging loop has this pattern:

  1. Try a rule until find `DetectChar 'Q'
  2. Output: DetectChar MATCHED Nothing and CONTEXT STACK ["Q_String", "base","Shebang"]
  3. Even if DetectChar returned nothing, the switch of context on the stack make it test against the two regex rules for QString.
  4. Those regexes come back empty and the context stack is reverted to `["base","Shebang"]
  5. Do not move to the next element and loop back on the same set of character.

It is the same issue for the tcl test, perl test and any test which will have a Q character outside of quotes (when quoted in the test files as "... QD ...", it correctly flagged as part of a string).

I just learned about the xml format from KDE and my Haskell is read-only. I suppose that the infinite loop appears from mbtok' from the Tokenizer.

Looking in the raku.xml, the Q String part does not have a DetectChar char="Q" rules so I am not sure from where it comes in the rules used by skylighting? I was expected that the head rules for the Q String context will start with the equivalent of this bit I found the part refering to the DetectChar "Q" part

<DetectChar attribute="Operator" context="Q_String" char="Q" lookAhead="1"/>
....
<context name="Q_String" attribute="String" lineEndContext="#stay" fallthrough="1" fallthroughContext="#pop">
        <RegExpr String="Q\s+(?=['(])|Q\s*(?=[^()>」\]}'a-zA-Z0-9])" context="Q_StringOpen" attribute="String"/>
        <RegExpr String="Q\s*$" context="Q_String2" attribute="String"/>
</context>

So it seems that at one point after trying to match a Q letter and failing in the context of Q switch, instead to go to the next set of token, it loops back to the Q. I think it is due to the lookAhead attribute and the equivalent ruleahead function in mbtok'.

Documentation about Q String in Raku for context.

jgm commented 3 years ago

Good, thanks very much for looking into that. To summarize, the minimal test case is

echo "QD" | skylighting -s raku -d raku.xml --trace 2> LOG

As you note, the first match is

Trying rule Rule {rMatcher = DetectChar 'Q', rAttribute = KeywordTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = True, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("Raku","Q_String")]}
DetectChar MATCHED Nothing
CONTEXT STACK ["Q_String","base","Shebang"]

But the second match is

Trying rule Rule {rMatcher = RegExpr (RE {reString = "Q\\s*$", reCaseSensitive = True}), rAttribute = StringTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("Raku","Q_String2")]}
CONTEXT STACK ["base","Shebang"]

taking us back where we started (again, without consuming anything). And that's the loop.

jgm commented 3 years ago

in the XML definition, the first rule is

        <DetectChar attribute="Operator" context="Q_String" char="Q" lookAhead="1"/>

and the Q_String context is

      <context name="Q_String" attribute="String" lineEndContext="#stay" fallthrough="1" fallthroughContext="#pop">
        <RegExpr String="Q\s+(?=['(])|Q\s*(?=[^()>」\]}'a-zA-Z0-9])" context="Q_StringOpen" attribute="String"/>
        <RegExpr String="Q\s*$" context="Q_String2" attribute="String"/>
      </context>
jgm commented 3 years ago

I have to say, it looks to me as if this syntax definition is calling for an infinite loop on this input. It's possible that the definition is buggy, but more likely, I have some misunderstanding of how the elements are supposed to work. I'm hoping @christoph-cullmann or another KDE highlighting developer can have a look at this and enlighten me.

notagoodidea commented 3 years ago

Thanks for taking the time to summarize the findings in a digestible form.

Having looked a bit at the spec of the xml format and at Raku definition of Q string too :

Tests files where tested against in Kate 20.08.03 (from Fedora 32 repo), it works fine to highlight the test cases.

If the definition is buggy, it may be possible to enforce the rule about the Q String and delimiters restrictions :

Some delimiters are not allowed immediately after Q, q, or qq. Any characters that are allowed in identifiers are not allowed to be used, since in such a case, the quoting construct together with such characters are interpreted as an identifier. In addition, ( ) is not allowed because that is interpreted as a function call. If you still wish to use those characters as delimiters, separate them from Q, q, or qq with a space.

Source : https://docs.raku.org/syntax/Q

christoph-cullmann commented 3 years ago

Hi,

I did run it with KSyntaxHighlighting, and yes, that behaves like skylighting.

To avoid dying for such things, we have some internal infinite "recursion" prevention that here will kick in and output:

"Endless state transitions, aborting highlighting of line."

in AbstractHighlighter::highlightLine.

For the context switches, no "current item" is remembered, if you pop back to a previous context, you start again with the first item there.

christoph-cullmann commented 3 years ago

If you have some fix for this, you can ping me here and I can directly apply it to KSyntaxHighlighting, you don't need to jump through bugs.kde.org/invent.kde.org yourself.

notagoodidea commented 3 years ago

I am working on a fix for raku.xml but it is not perfect yet[1]. Changing (@line 500)

        <DetectChar attribute="Operator" context="Q_String" char="Q" lookAhead="1"/>

by

        <RegExpr attribute="Operator" context="Q_String" String="Q(?!&ident;|\(|\')" lookAhead="1" />

It pass all test cases from skylighting but if a Raku test is available with the expected output that would be nice (if not, I can create it but I will need some help to create the expected output).

[1]: Because it is an highlighter and not a linter, is it ok if the results for illegal syntax messed up the highlighting? If it is the case, i think it is good.

@christoph-cullmann do I try push the change to the main repo or can you do it?

Available as a gist here


The solution at either the parser or the implementor level would be to make the fallthroughContext attributes pops to a default context instead of the previous context because it will always results on a endless loop the way I understand it.

I don't see how coming back to the previous stack and match the item against the same rule can provide a different output. However, I don't have experience with highlighing xml format from KDE/Kate. I just read a bit to understand what's going one for this issue.

jgm commented 3 years ago

The solution at either the parser or the implementor level would be to make the fallthroughContext attributes pops to a default context instead of the previous context because it will always results on a endless loop the way I understand it.

There are a lot of contexts in the definitions that have fallthroughContext="#pop". I think you're right that if you get to one of these contexts by matching a rule and adding a context, and it falls through, then you'll be right back where you started and loop. But of course you might have gotten to the context by dropping a context.

christoph-cullmann commented 3 years ago

I opened some internal issue for the Raku file

https://invent.kde.org/frameworks/syntax-highlighting/-/issues/7

I don't see how coming back to the previous stack and match the item against the same rule can provide a different output.

Normally, you consume stuff on match.

Therefore, for a common case, you would do a context switch after some consummation, then later pop back.

Here you do a lookahead, without any consummation and then pop back without any consummation, too. That leads to this infinite loop.

That isn't the normal case for which fallthroughContext works as designed.

notagoodidea commented 3 years ago

@christoph-cullmann Thanks for opening the internal issue.

Here you do a lookahead, without any consummation and then pop back without any consummation, too. That leads to this infinite loop.

Yep, I understand. Maybe can "LookAhead" be part of the stack information either as a kind of hidden state where a stack with a lookahead attribute will be ["Q_String", "LookAhead", "base", "Shebang"] for this specific example. When no consumption appears and it pop back to the previous context if "LookAhead" is met either flag an error allowing to flag the infinite loop directly or either, pop back with a specialized function that skips rules with lookahead attribute. It is very drafty and I can already anticipate a few problems:

I am not neither really familiar with all that. Maybe there is more reasons to not try something like this or it can simply be the wrong approach to the issue.

Maybe a warning in the documentation on the danger of mixing fallthroughContext and lookahed can be useful at least to discourage this association.

notagoodidea commented 3 years ago

Thanks to @christoph-cullmann and jpoelen, a fix was provided and merged. I tried it against the abc.x tests of skylighting and no more infinite loop.

Does a test case already exist for raku or must one be provided?

jgm commented 3 years ago

I've merged the new raku definition. Thanks all!