Experiment with replacing syntect_server with WASM syntect module

tsenart commented 3 years ago

When we initially implement syntect_server, WASM support was limited, but that has since changed. Could we do syntax highlighting purely in the frontend, still using syntect, but using WASM? Rust has excellent WASM support.

github-actions[bot] commented 3 years ago

Heads up @tsenart - the "team/cloud" label was applied to this issue.

slimsag commented 3 years ago

Thanks for filing! This has come up a few times in the past but I haven't had a place I can refer back to due to Slack message retention..

In general, its not a bad idea, definitely something I have considered multiple times (including recently), but it would be a fairly hefty amount of work to make it mostly competitive with the server-side implementation, and even then there are some fundamental limitations (like needing to transfer the whole file to the client for highlighting) I'm not sure we could overcome. Overall, I'm not sure it is worth it.

I am not sure what the bundle size of the code alone (syntect + oniguruma) would be once compiled to WASM, but I suspect fairly large. We would likely need to do async loading for the initial WASM bundle to avoid harming page load times.
The language syntax definitions (https://github.com/slimsag/Packages) and themes are quite large, we would need to progressively load them as-needed. In their packed binary form we would need around ~545.6 KiB for all the languages:

$ ls -lah ./assets/*
-rw-r--r--  1 slimsag  staff   7.5K Oct 16 17:21 ./assets/default.themedump
-rw-r--r--  1 slimsag  staff   5.1K Oct 16 17:21 ./assets/default_metadata.packdump
-rw-r--r--  1 slimsag  staff   533K Oct 16 17:21 ./assets/default_newlines.packdump

While Rusts' WebAssembly support has improved orders of magnitude, we would still need to figure out how to compile Oniguruma (C regexp library) to WebAssembly and link syntect against that, I suspect a non-trivial amount of work.
- There is a fancy-regexp mode in syntect which does not depend on Oniguruma, but because the regexps get translated it is a bit buggy and does not work even for just the syntaxes included in the default syntect repo - probably we would see worse results if including our expanded set of syntaxes which they do not test against upstream.
- Option 1 here would be to wait for upstream to support oniguruma compilation to WASM https://github.com/trishume/syntect/issues/135
- Option 2 here would be to use fancy-regexp, but add an extensive test suite for each language we support and fix the fancy-regexp bugs that come up. Additionally, fancy-regexp reduces perf by half so we'd probably need to fix that.
Performance, WASM is definitely going to perform worse than bare metal Rust, but by how much is unknown. However, we do know the following to be true:
- If we want good highlighting results, we need to pass the highlighter the entire file. In the case of search results, this would mean we would be required to transfer the entire file to the client instead of just the range of code we intend to display. (this is an issue today, we send the entire highlighted file back to the client and then only show part of it, but I plan on addressing this as soon as I get back from vacation) I think this would be the one single aspect keeping it from being competitive with the backend implementation that we could not solve
- We would definitely want to run syntect in a background webworker, so it runs on a different OS thread, and then transfer the results back to render.
Instability and page lock-ups. Syntect is a pretty complex state machine and sometimes highlighting files causes it to either (a) take a really long time (e.g. 30s+) or (b) it gets stuck entirely, consuming 100% of the CPU core. -- on the backend, we solve this with aggressive timeouts, the option to "wait for highlighting" in the frontend by clicking a button when this happens, and killing the subprocess if it gets completely hung. These are definitely bugs in syntect, timeouts and subprocess killing are just hacky workarounds for this, but if we ran this in the browser we would need some similar way of doing this or to contribute upstream some good timeout/cancellation logic - otherwise the browser window would lock up I think. It doesn't happen super frequently, but frequently enough that it would be a problem.
Caching of the parsing/compilation of syntaxes - in the backend we use thread local storage and lazy_static! to ensure we only parse theme files once, and only compile the (enormous) amount of regexps found in each syntax definition - I think we could replicate this on the frontend too by keeping the compiled forms around in memory given we are a SPA - but my TS knowledge limits me here in terms of what is possible, especially when it comes to webworkers being mixed in.
Meta-issues.. we have added a fair amount of logic to the Go side https://github.com/sourcegraph/sourcegraph/blob/main/internal/highlight/highlight.go because folks saw this as an easier path to contributing instead of doing it upstream in syntect where it would be more appropriate. In specific some things we would have to deal with:
- The frontend today doesn't consume the raw HTML syntect gives us, it expects a <table> and Syntect gives us <span>s, so we do this permutation here
- Chrome cannot render super long horizontal HTML elements, such as a highlighted minified single-line JS blob. It makes performance abysmal (see https://github.com/sourcegraph/sourcegraph/issues/6489) so we do post-processing to find ultra long lines and unhighlight them
- The way we do highlighting for diff pages would have to be updated, totally feasible but would require work as today we highlight the two files then merge their highlighted forms

tsenart commented 3 years ago

Thanks for the extensive notes! Do you think it's unsolvable to make syntax highlighting work without the whole file? Could we not find the most immediate context boundary (e.g. like comby finds delimiters) and send that over?

On another note: Could we not be well served by what VS Code uses for syntax highlighting?

slimsag commented 3 years ago

Do you think it's unsolvable to make syntax highlighting work without the whole file? Could we not find the most immediate context boundary (e.g. like comby finds delimiters) and send that over?

That'd be possible, good point - though it's not what Syntect does out of the box, we'd have to do that for each language.

On another note: Could we not be well served by what VS Code uses for syntax highlighting?

VS Code uses an interesting hodge-podge of options for syntax highlighting:

Many packages ship TextMate grammars, and VS Code implements its own highlighter similar to Syntect (but without support for the more advanced Sublime Text grammars, which are a superset of TextMate grammars). They also use Oniguruma bindings to implement this, and their bindings do not compile to WASM today. I don't know TS well enough to comment how hard this part would be to extract from VS Code (code link)
Because they don't want to support Sublime Text grammars, which support many more complex features than TextMate grammars and are required for highlighting a number of languages, they propose their own syntax highlighting scheme based on language servers knowledge of language semantics. i.e. the language server provides semantic token information which VS Code uses to highlight code. See https://code.visualstudio.com/api/language-extensions/semantic-highlight-guide#theming - it is possible we could do something similar with LSIF, or tree-sitter, but of course the number of languages supported with both is FAR less than TextMate/sublime-syntax grammars in the wild.
Worth noting explicitly that their actual language definitions and themes are discovered via editor plugins, i.e. its not a single file like a .tmLanguage grammer or .sublime-syntax file we can just import and use, they get that info, and theme info, via editor extensions. I am sure we could do this, but not sure how hairy the details are there.

If you want some info about other options I considered back in 2018, there is a write-up here: https://news.ycombinator.com/item?id=17932653

Syntax highlighting is hard, and old. Either the languages supported are minuscule, or you're falling back to TextMate grammers and something more complex (semantic highlighting, sublime syntaxes) regularly.

felixfbecker commented 3 years ago

cc @camdencheek

camdencheek commented 3 years ago

@felixfbecker I actually spent a day or so looking into this! I got as far as getting highlighting running in the browser. Ultimately, performance wasn't great with the pure-rust regex crate. It would take >10s for large files. At some point, I'd like to see if I can get onig compiling to WASM for better perf, but I'd hit my timebox for the wasm highlighting spike.

schickling commented 3 years ago

Having spent way too much time on this myself (and also after having gotten Syntect to run in the browser with not-great performance results) I eventually landed on shiki which I can highly recommend.

camdencheek commented 3 years ago

@schickling thanks for the recommendation!

sourcegraph / sourcegraph-public-snapshot

Experiment with replacing syntect_server with WASM syntect module #15346