Incremental line-by-line highlighting breaks some languages

davidben commented 4 years ago

highlight.js supports highlighting text incrementally via the continuation parameter to the highlight function. Gerrit uses this to highlight line-by-line. However, this causes some features, notably C++ raw string matching, to break. See the discussion here.

(Ironically, raw string matching partially worked in #1825 and then I broke it again for us when I tried to make it more featureful in #1897.)

I changed the tests to feed all the markup samples line-by-line to demonstrate the issue. Things mostly work, but when a begin or end element crosses a line boundary, things break. https://github.com/highlightjs/highlight.js/compare/master...davidben:line-by-line-tests

This suggests that the strategy in #1897 was wrong and probably C++ raw strings should have separate begin and end units. But endSameAsBegin doesn't seem quite flexible enough, unless there's some trick I'm missing. One thought is begin matches R"<blah>( and then we have a endFromBegin attribute which takes a callback and will compute )<blah>" for the end pattern given a begin.

But it's not entirely clear what the general API contract here is. Supposing incremental highlight is a thing highlight.js wants to support (I assume yes, given the continuation parameter), it seems there's basically no hope in allowing a begin or end unit to span an input boundary. The API needs to return something for the text that was passed in, so you can't lookahead across chunks. That means highlight.js needs to have opinions about what input boundaries need to be supported by languages (One call per character obviously cannot do anything useful. Line-by-line seems reasonable?), or we change the incremental highlight API altogether.

Thoughts?

joshgoebel commented 4 years ago

But endSameAsBegin doesn't seem quite flexible enough, unless there's some trick I'm missing. One thought is begin matches R"( and then we have a endFromBegin attribute which takes a callback and will compute )" for the end pattern given a begin.

Currently you can do this by nesting, but yes it's annoying. You might also see my work on Elixir:

https://github.com/highlightjs/highlight.js/pull/2257/files

It doesn't use endSameAsBegin, but same concept with first matching on the prefix and then the child matchers can focus on just the begin/end syntax, not the prefix.

continuation really is NOT meant as a way to do line-by-line processing. This would be better handled by a post-highlight or post-parse plugin as mentioned over in the plugin thread.

it seems there's basically no hope in allowing a begin or end unit to span an input boundary. The API needs to return something for the text that was passed in, so you can't lookahead across chunks.

I'm not sure I completely follow, I will follow your link and see if it's helpful. I've found the "outside" edge of the continuation stuff to be more robust than I expected... ie it remembers the state the parser is in re-opens those spans and continues the processing, even after handling a sublanguage in the middle... but again I'm speaking for what continuation were INTENDED for... they weren't intended for line-by-line processing.

joshgoebel commented 4 years ago

  // When marking up an input line-by-line, elements which span multiple lines
  // will be divided up into several spans. For instance, a multi-line C++
  // comment may be marked up as:
  //
  //   <span class="hljs-comment">/*</span>
  //   <span class="hljs-comment">Hello</span>
  //   <span class="hljs-comment">*/</span>

This looks like the behavior I've come to expect with continuations.

joshgoebel commented 4 years ago

Perhaps it'd be useful to show a "failing case" and what happens and what you would hope to happen, etc...

joshgoebel commented 4 years ago

I'd suggest you take a look at what's going on here:

https://github.com/highlightjs/highlight.js/issues/1086

I think doing anything line-by-line would be done by walking the tree and dealing with \n inside text nodes however you saw fit... Doing something simple like wrapping every line in it's own span would probably be pretty trivial one that foundation.

There has also has been mention of a line-by-line callback especially, but honestly I'd probably rather see this first done by a plugin... no reason plugins couldn't extend the callback system or even have their own.

davido commented 4 years ago

FWIW, the tracking issue in Gerrit Code Review project is here: [1]. The current repro is: [2].

[1] https://bugs.chromium.org/p/gerrit/issues/detail?id=7748 [2] https://boringssl-review.googlesource.com/c/boringssl/+/38665/1/util/fipstools/acvp/modulewrapper/modulewrapper.cc

joshgoebel commented 4 years ago

Commented on that thread just to clear up the fact that we do NOT have a line by line mode. In fact the original author was actually pretty against any type of "line number" feature and such things. And it's still mentioned in our docs today, etc... not that I think we're still THAT hostile to it today, but definitely sounds like a good use for a plugin.

We just need better hooks to make these type of things easier to do.

davidben commented 4 years ago

Ah, interesting. Hopefully one of the Gerrit devs will respond either here or on the Gerrit bug you commented on. (I don't actually work on Gerrit. I work on Chromium and BoringSSL, which use Gerrit, and highlight.js was easy enough to hack on, so I'd been poking at the highlighting bugs I was running into. :-) )

You mentioned in the Gerrit bug that the continuation feature is primarily for internal use. Should it be removed from the documentation? The documentation reads to me like feeding in chunks of partial data is supposed to work.

It doesn't use endSameAsBegin, but same concept with first matching on the prefix and then the child matchers can focus on just the begin/end syntax, not the prefix.

Hrm. I toyed with that a bit but it seemed there'd be issues with C++'s raw strings ending in )<delimiter>", so merely looking for <delimiter> might end the string early.

Of course, if line-by-line isn't supposed to work anyway, that's no longer a reason to change the current C++ definition. That said, hljs currently won't match R"(half a raw string, whereas it will match /* half a block comment, so perhaps there's something to be said for changing the C++ mode anyway?

This looks like the behavior I've come to expect with continuations.

It seems the issue may have since been clarified, but in case there's still confusion, yeah that behavior makes sense. The "issue" is the lineByLineExceptions list of tests that don't work even after accounting for it. (I say "issue" in quotes because it sounds like line-by-line mode isn't meant to work anyway.)

joshgoebel commented 4 years ago

You mentioned in the Gerrit bug that the continuation feature is primarily for internal use. Should it be removed from the documentation? The documentation reads to me like feeding in chunks of partial data is supposed to work.

I'd actually be for that if @egor-rogov had no objections. It really has all sorts of caveats and really it's more of a limited thing to work with sublanguages. I was not aware we were encouraging its broader use. Either way I'll still say: Not really intended for line to line.

Of course, if line-by-line isn't supposed to work anyway, that's no longer a reason to change the current C++ definition.

Continuation (in the core library) is intended to support subLanguages (pretty sure that's the only reason it's there, but I'm open to being corrected)... and personally I don't think we should REQUIRE that any languages parse in a certain way if they naturally find it beneficial to parse in larger chunks. IE, we shouldn't do things that limits peoples choices in writing grammars without VERY good reasons to do so.

joshgoebel commented 4 years ago

That said, hljs currently won't match R"(half a raw string, whereas it will match /* half a block comment, so perhaps there's something to be said for changing the C++ mode anyway?

You mean when parsing tiny bits at a time only, yes?

davidben commented 4 years ago

Nah, maybe the file is truncated or the author forgot to close the string. Usually syntax highlighters will highlight the unfinished thing.

joshgoebel commented 4 years ago

Ah, now that might be worth thinking about. That would be one case where doing the full match would not turn out so well...

egor-rogov commented 4 years ago

You mentioned in the Gerrit bug that the continuation feature is primarily for internal use. Should it be removed from the documentation? The documentation reads to me like feeding in chunks of partial data is supposed to work.

I'd actually be for that if @egor-rogov had no objections.

I'd rather not remove it ('cause it is a part of the API), but explicitly explain that this is intended for internal use and may not behave as one might expect.

davidben commented 4 years ago

Ah, now that might be worth thinking about. That would be one case where doing the full match would not turn out so well...

What are your thoughts on something like:

endCallback: function(begin) {
  // Compute ')<delimiter>"' from 'R"<delimiter>('
},

Returning a match object for the begin regex would be smoother, but probably tricky given how the regexes get all joined up.

joshgoebel commented 4 years ago

What are your thoughts on something like:

I'm receptive to the idea, though it'd probably be a more generic match callback that fires on match or perhaps start/finish... then you could do what you wanted. Ie, hook "start" to generate end on the fly, this would allow the current endSameAsBegin to be built with only a plug-in, as it's a much simpler version of what you're suggesting.

Although it's actually not that simple because the begin/end matchers are all compiled together and are not modified at run-time. endSameAsBegin currently only works because it's a MORE SPECIFIC match of the same type... IE, end (a copy of begin) will always match, but then we double check to see if the more specific end match was triggered.

Otherwise we might have to recursively recompile the regexs every time the regexes were dynamically changed... not sure what type of performance implications that would have.

Returning a match object for the begin regex would be smoother, but probably tricky given how the regexes get all joined up.

Huh? Not sure I follow? Accessing the invividual capture groups isn't that hard... but I'm not sure exactly what you're suggesting with this comment.

davidben commented 4 years ago

Although it's actually not that simple because the begin/end matchers are all compiled together and are not modified at run-time. endSameAsBegin currently only works because it's a MORE SPECIFIC match of the same type... IE, end (a copy of begin) will always match, but then we double check to see if the more specific end match was triggered.

Ah, interesting. Okay, I think I can do something with that. I'll try my hand at this and we'll see if you like the result.

Huh? Not sure I follow? Accessing the invividual capture groups isn't that hard... but I'm not sure exactly what you're suggesting with this comment.

I mean that you could imagine the function(begin) { ... } above being passed in a regex match object rather than the string. My recollection was that, since you combine all the regexes into one jumbo regex, the match groups are all shifted and such. Or maybe I'm entirely confused and misremembering. Anyway, not important, I'll poke around and see what's cleanest to do.

joshgoebel commented 4 years ago

I mean that you could imagine the function(begin) { ... } above being passed in a regex match object rather than the string.

Yeah, that'd make sense... you'd have to pass it a reference to mode also or give it some API to effect things if that was the idea. Then you also run into the issue that modes can be recursive so do the parser handle the call stack or force that problem onto the user, etc... may not be a problem for C string, but I'm always thinking of the full scope of features.

My recollection was that, since you combine all the regexes into one jumbo regex, the match groups are all shifted and such.

Yes, but we keep track of the match offsets (or else we'd have know way to know which rule matched)... so if we had to "rewrite" them back into a sequential array starting at 1 that wouldn't be that hard to do.

joshgoebel commented 4 years ago

Not to mention the security implications of something like this. :-) Right now grammars are technically code, but are only evaluated once, and what you see is what you get - reasonably easy to audit. They can't really be altered or exploited by specific input (other than bad regexes perhaps going into infinite loops)... so something like this would add a whole other potential vector for attacks.

joshgoebel commented 4 years ago

We can continue the discussion on the PR as I'm interested in more dynamic grammars in general - and if we figure that out we can then decide where it might and might not make sense to use such dynamism in existing grammars. But I just want to be clear this grammar (cpp, and also the others you flag) are NOT considered broken or buggy because your custom line-by-line parsing code fails to work with them - continuations are not intended to be used in this fashion - and this issue is the reason why. :-)

If this type of CPP string really doesn't contain any escape sequences then the simplest way of handling it is probably to just match it with a single regex and take advantage of the group matching capabilities, as was done in #1897.

Adding docs clarifying that continuation is NOT intended for be using in the fashion you're using it:

https://github.com/highlightjs/highlight.js/pull/2274/

I simply don't think it's reasonable to force grammas to always care about WHERE line breaks are at... when it's easier for a grammar to match against line breaks, we're not opposed to them doing so.

joshgoebel commented 4 years ago

Closing this issue as resolved - existing behavior is expected behavior. More than happy to continue the discussion on a BETTER and more reliable way to do line by line though, just not with continuations.

joshgoebel commented 4 years ago

@davidben Have you followed the discussion here:

https://github.com/highlightjs/highlight.js/issues/1086

Given a raw parse tree (of nodes), wouldn't writing a line-by-line outputter be a few lines of code? You merely have to track the current node "stack" (which you have to do for a normal render anyways) and each time you see a new line in text you output the closing tags... then the newline, then re-open the tags, then output the text portion that comes AFTER the new line.

Am I missing something?

This would be the "correct" way to do line-by-line stuff off the top of my head.

joshgoebel commented 4 years ago

Also, if you're target is the browser you can apply the line stuff BEFORE the rendering also... when using highlightBlock, HTML is preserved and merged seamlessly... so if you wanted to do something like:

<pre><code id="zeroed">
<span class="line1">actual code</span>
<span class="line2">goes here</span>
</code></pre>

<script>
highlightBlock(document.querySelector('pre code#zecode'))
</script>

Etc... and it should "just work".

joshgoebel commented 4 years ago

I think a lot of people make this type of thing WAY TOO complex though... I'd consider just having two entirely separately code blocks inside the pre (or a similar div grouping) and having one be the line numbers and the other be the source... no need for the parser to care about lines at all... though I'm not sure what your use case is exactly.

joshgoebel commented 4 years ago

Given a raw parse tree (of nodes), wouldn't writing a line-by-line outputter be a few lines of code? You merely have to track the current node "stack" (which you have to do for a normal render anyways) and each time you see a new line in text you output the closing tags... then the newline, then re-open the tags, then output the text portion that comes AFTER the new line.

Now that parse tree support is in 10.0 someone really should try doing it this way and see how much simpler it is. :-)

highlightjs / highlight.js

Incremental line-by-line highlighting breaks some languages #2259