sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.11k stars 1.28k forks source link

Experiment with replacing syntect_server with WASM syntect module #15346

Open tsenart opened 3 years ago

tsenart commented 3 years ago

When we initially implement syntect_server, WASM support was limited, but that has since changed. Could we do syntax highlighting purely in the frontend, still using syntect, but using WASM? Rust has excellent WASM support.

github-actions[bot] commented 3 years ago

Heads up @tsenart - the "team/cloud" label was applied to this issue.

slimsag commented 3 years ago

Thanks for filing! This has come up a few times in the past but I haven't had a place I can refer back to due to Slack message retention..

In general, its not a bad idea, definitely something I have considered multiple times (including recently), but it would be a fairly hefty amount of work to make it mostly competitive with the server-side implementation, and even then there are some fundamental limitations (like needing to transfer the whole file to the client for highlighting) I'm not sure we could overcome. Overall, I'm not sure it is worth it.

  1. I am not sure what the bundle size of the code alone (syntect + oniguruma) would be once compiled to WASM, but I suspect fairly large. We would likely need to do async loading for the initial WASM bundle to avoid harming page load times.
  2. The language syntax definitions (https://github.com/slimsag/Packages) and themes are quite large, we would need to progressively load them as-needed. In their packed binary form we would need around ~545.6 KiB for all the languages:
$ ls -lah ./assets/*
-rw-r--r--  1 slimsag  staff   7.5K Oct 16 17:21 ./assets/default.themedump
-rw-r--r--  1 slimsag  staff   5.1K Oct 16 17:21 ./assets/default_metadata.packdump
-rw-r--r--  1 slimsag  staff   533K Oct 16 17:21 ./assets/default_newlines.packdump
  1. While Rusts' WebAssembly support has improved orders of magnitude, we would still need to figure out how to compile Oniguruma (C regexp library) to WebAssembly and link syntect against that, I suspect a non-trivial amount of work.
    • There is a fancy-regexp mode in syntect which does not depend on Oniguruma, but because the regexps get translated it is a bit buggy and does not work even for just the syntaxes included in the default syntect repo - probably we would see worse results if including our expanded set of syntaxes which they do not test against upstream.
    • Option 1 here would be to wait for upstream to support oniguruma compilation to WASM https://github.com/trishume/syntect/issues/135
    • Option 2 here would be to use fancy-regexp, but add an extensive test suite for each language we support and fix the fancy-regexp bugs that come up. Additionally, fancy-regexp reduces perf by half so we'd probably need to fix that.
  2. Performance, WASM is definitely going to perform worse than bare metal Rust, but by how much is unknown. However, we do know the following to be true:
    • If we want good highlighting results, we need to pass the highlighter the entire file. In the case of search results, this would mean we would be required to transfer the entire file to the client instead of just the range of code we intend to display. (this is an issue today, we send the entire highlighted file back to the client and then only show part of it, but I plan on addressing this as soon as I get back from vacation) I think this would be the one single aspect keeping it from being competitive with the backend implementation that we could not solve
    • We would definitely want to run syntect in a background webworker, so it runs on a different OS thread, and then transfer the results back to render.
  3. Instability and page lock-ups. Syntect is a pretty complex state machine and sometimes highlighting files causes it to either (a) take a really long time (e.g. 30s+) or (b) it gets stuck entirely, consuming 100% of the CPU core. -- on the backend, we solve this with aggressive timeouts, the option to "wait for highlighting" in the frontend by clicking a button when this happens, and killing the subprocess if it gets completely hung. These are definitely bugs in syntect, timeouts and subprocess killing are just hacky workarounds for this, but if we ran this in the browser we would need some similar way of doing this or to contribute upstream some good timeout/cancellation logic - otherwise the browser window would lock up I think. It doesn't happen super frequently, but frequently enough that it would be a problem.
  4. Caching of the parsing/compilation of syntaxes - in the backend we use thread local storage and lazy_static! to ensure we only parse theme files once, and only compile the (enormous) amount of regexps found in each syntax definition - I think we could replicate this on the frontend too by keeping the compiled forms around in memory given we are a SPA - but my TS knowledge limits me here in terms of what is possible, especially when it comes to webworkers being mixed in.
  5. Meta-issues.. we have added a fair amount of logic to the Go side https://github.com/sourcegraph/sourcegraph/blob/main/internal/highlight/highlight.go because folks saw this as an easier path to contributing instead of doing it upstream in syntect where it would be more appropriate. In specific some things we would have to deal with:
tsenart commented 3 years ago

Thanks for the extensive notes! Do you think it's unsolvable to make syntax highlighting work without the whole file? Could we not find the most immediate context boundary (e.g. like comby finds delimiters) and send that over?

On another note: Could we not be well served by what VS Code uses for syntax highlighting?

slimsag commented 3 years ago

Do you think it's unsolvable to make syntax highlighting work without the whole file? Could we not find the most immediate context boundary (e.g. like comby finds delimiters) and send that over?

That'd be possible, good point - though it's not what Syntect does out of the box, we'd have to do that for each language.

On another note: Could we not be well served by what VS Code uses for syntax highlighting?

VS Code uses an interesting hodge-podge of options for syntax highlighting:

If you want some info about other options I considered back in 2018, there is a write-up here: https://news.ycombinator.com/item?id=17932653

Syntax highlighting is hard, and old. Either the languages supported are minuscule, or you're falling back to TextMate grammers and something more complex (semantic highlighting, sublime syntaxes) regularly.

felixfbecker commented 3 years ago

cc @camdencheek

camdencheek commented 3 years ago

@felixfbecker I actually spent a day or so looking into this! I got as far as getting highlighting running in the browser. Ultimately, performance wasn't great with the pure-rust regex crate. It would take >10s for large files. At some point, I'd like to see if I can get onig compiling to WASM for better perf, but I'd hit my timebox for the wasm highlighting spike.

schickling commented 3 years ago

Having spent way too much time on this myself (and also after having gotten Syntect to run in the browser with not-great performance results) I eventually landed on shiki which I can highly recommend.

camdencheek commented 3 years ago

@schickling thanks for the recommendation!