Syntax highlighting - Githubissues

The lack of syntax highlighting is pretty lame.

However, it's hard to do syntax highlighting in a way that leaves us in control of the exact state of the DOM (to highlight matches and... generally work with React) and that will work well with SSR #2. Most of the classics like highlight.js or prismjs assume control of the DOM, doing to it what they please, and don't have any particular SSR support.

What looks to be most promising is this family of syntax highlighting tools:

https://github.com/wooorm/lowlight derived from highlight.js
https://github.com/wooorm/refractor derived from prismjs
https://github.com/wooorm/starry-night derived from PrettyLights, which is GitHub's syntax highlighter

These all can produce syntax highlighting as data, which is ideal for rendering with React and for us to highlight matches in addition to syntax. Indeed, https://github.com/react-syntax-highlighter/react-syntax-highlighter wraps one of the first two of those projects.

https://github.com/suin/refractor-group-by-lines and https://github.com/suin/refractor-flatten are examples of the kinds of transformations we might want to do with the output of those tools.

I consider this issue blocked on SSR #2. Syntax highlighting is going to make the already heavy CSR-only site that much more worse, especially if we go for high-fidelity with starry-night, which uses a giant wasm blob. Syntax highlighting should exclusively happen on the server; SSR and CSR should consume the same search API on the server that returns search results already "rendered" with matches and syntax highlighting.

https://github.com/yunlingz/vscode-theme-github-light Might be an appropriate highlighting theme. Theme is a whole other can of worms.

Just throwing something out something that I've been thinking about here since I've got the same problem and I was already here.

The idea I've been kicking around is having essentially two layers on the DOM. Top layer is the syntax highlighting, and the bottom layer is the highlighted text. The second layer gets the exact same text so the positions of all the characters match up, but has a color: transparent, and a background-color: highlight-color.

I checked what GitHub does, and they do basically that, at least in the fileviewer - when you open the find menu, type a word, and a match gets highlight, if you inspect the dom, a span gets inserted containing the line on which the match is in, and then that line is broken up into highlighted and non-highlighted parts, and the highlighted parts have a background-color. Screen Shot 2023-03-13 at 9 40 34 AM

As an optimization, when you could insert that backing/transparent span only on lines that contain a match.

On the search results page they actually do something much simpler, given some SSR'd dom output, like <span class="token-var">var</span>, they just surround the entire dom element in a mark tag.

Screen Shot 2023-03-13 at 9 45 44 AM

I use chroma, and actually do something like the following, which I think would be easy to map to the second solution, and just render a mark around certain ranges.

...
for idx, tokens := range lines {
    // we want to convert the line into its html equivalent
    // each line can have n tokens
    // each token is a span with (potentially) styling

    var b strings.Builder
    for _, token := range tokens {
        html := html.EscapeString(token.String())
        attr := styleAttr(css, token.Type)
        b.WriteString(fmt.Sprintf("<span%s>%s</span>", attr, html))
    }

    outLines[idx] = template.HTML(b.String())
}
...

Anyways, just two ideas I'm kicking around. If I end up with the time to do either, will update results here if you're interested.

My first attempt at this is on https://github.com/isker/neogrok/tree/syntax-highlighting-starrynight. It uses https://github.com/wooorm/starry-night to do the actual highlighting in wasm, in content-parser.ts, producing a stream of ContentTokens as we already do.

Turns out this is really CPU intensive, to the extent that a normal neogrok server under a bit of load (more than one user would typically generate, but not more than that of a few, I think) starts to have its event loop degrade. I tested a deployment of this to the demo site, and it actually fell over on some combination of CPU and RAM, because the fly.io limits are so low (1 "shared" vcpu and 256mb of ram, I think CPU contention resulted in heap size growing more than it does locally for me, due to pending requests piling up, and it was oomkilled).

I profiled the server with https://github.com/davidmarkclements/0x and all of the time is being spent inside of the wasm blob (ultimately coming from https://github.com/microsoft/vscode-oniguruma), so there's no optimization to do in any of the JS, mine or others'. Here's the flamegraph: flamegraph.zip.

So, the remaining options are to use something that's cheaper but lower fidelity (something regex-based like highlightjs/lowlight might be faster, but the CSS approach I'm taking in the starry-night branch would have to be rebuilt, more painstakingly, for highlightjs), or to do some combination of debouncing and deferring of highlighting: debouncing such that we are not highlighting on every keystroke, only after results have "settled"; deferring such that we only highlight during rendering, instead of in the search API (during rendering less code is visible, so less code needs to be highlighted).

Deferring to render would notably mean pushing the highlight deps into browsers, which would be sad. You can lazily load the relevant grammars in starry-night and other libs that wrap vscode-oniguruma (https://github.com/shikijs/shiki is an example of another lib, which emits perhaps marginally less annoying data than starry-night), but the base wasm cost is already a few hundred kb gzipped. But we might not have any other choice than to do this, if we want the neogrok server to be cheap to run.

One final thing worth noting is that, when highlighting search results which only include matching lines and context, we will get lower fidelity highlighting than if we were highlighting the full file. One big example I'm seeing is from the middle of jsdoc-style comments, like this:

/**
 * one
 * two
 * three
 */

If you only give a highlighter the two line, it will of course have no idea you're in a comment, and the highlighting will be thusly wrong. We can ask zoekt for full files along with the search results, and pull the relevant lines out of the highlighted versions of those, but that of course both bloats responses (we would not want to send them to browsers, and it might materially harm JSON parse perf in the neogrok server?), and substantially increases the cost of highlighting (we would need some kind of asynchronous highlighting approach, perhaps even with caching, and it would be fairly compute intensive). So I can't say I'm very eager to try to go down this route, as it looks like it'd involve a material expansion of architectural scope (like if we needed redis involved), and if not, a material expansion of compute requirements on the server.

I looked briefly at what sourcegraph.com does, and they have a pretty strange approach of streaming search results to the browser, only for the browser to go right back to the server to render the results (matches + syntax highlighting) to HTML, which the browser then displays. I don't really understand the point of pushing the rendering to a second round trip; maybe they just don't have a good factoring of internal service dependencies and are thus waterfalling in the browser.

Insights into how sourcegraph does syntax highlighting here.

TL;DR not well (so they say 🌞), but better than most everything else. I don't think we can aspire to that kind of approach, we need something substantially more straightforward.

isker / neogrok

Syntax highlighting #3