Semantic Token Support - Githubissues

Doekeb commented 3 months ago

LSP supports Semantic Tokens which editors and colorschemes can opt into in order to provide "smarter" language highlighting than pure tree-based highlighting. https://code.visualstudio.com/api/language-extensions/semantic-highlight-guide https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocument_semanticTokens

Notably, Neovim now supports semantic tokens (https://github.com/neovim/neovim/pull/21100) and, more recently, semantic token modifiers (https://github.com/neovim/neovim/pull/22022).

This feature has been requested in this repo here: https://github.com/python-lsp/python-lsp-server/issues/33 In the unmaintained base here: https://github.com/palantir/python-language-server/issues/933 In another jedi-based language server here: https://github.com/pappasam/jedi-language-server/issues/137 And it's implementation has been attempted and abandoned twice in the latter: https://github.com/pappasam/jedi-language-server/pull/196 and https://github.com/pappasam/jedi-language-server/pull/231

There is a maintained fork of alternative tool for Neovim here https://github.com/wookayin/semshi, but it suffers from the major drawbacks that it is only available for Neovim and highlight colors are hardcoded, so they are unlikely to match the user's colorscheme.

This PR only implements full document protocol. Performance may be improved by also implementing full document delta protocol, and the range protocol.

Here are some examples in two different colorschemes with only very simple rules implemented so far. Tree-based highlighting is always on the left, and augmented with Semantic Token highlighting is always on the right.

Functions and classes

Tree-based highlighting infers whether a reference is a class based on its name and therefore doesn't highlight dingus_mc_bingus as a class even though it is.
Similarly, tree-based highlighting can't determine that my_function and MyFunction are both functions.

classes_functions_cp classes_functions_tn

Imports

Tree-based highlighting can't determine what kind of thing imported names are, other than by their naming (which often break convention even in standard library modules in python)

imports_cp imports_tn

Parameters

Tree-based highlighting colors parameters in the signature of a function/method differently than when they are used in the body. Semantic token highlighting maintains parameter highlighting until the variable is re-assigned.
Note that tree-based highlighting treats the name self as a special token. This is not language smarts as evidenced by the lack of highlighting of the language-equivalent this. Semantic tokens currently colors both self and this inside a method as a regular parameter, but this could be improved using semantic token modifiers and a bit more inference (the colorschemes I'm using here don't apply any different styles to modifiers). Note that even in the semantic token augmented version, tree-based highlighting takes over on self when its outside a method.

parameters_cp parameters_tn

Properties

Tree-based highlighting guesses whether an attribute is a property or a method based on the presence of parentheses. Semantic token highlighting knows the difference.

properties_cp properties_tn

rchl commented 3 months ago

Do you have some performance data? For example how long it takes to generate tokens in a 2000 lines document? It feels like it would be very slow to trigger "goto" for each "name" like that.

Ideally such feature would be implemented by jedi and use some form of caching to speed things up. The LSP semantic tokens is designed in a way that should make the case of adding/removing text pretty fast but in your implementation the whole work seems like will be done from scratch on every single change.

Doekeb commented 3 months ago

Do you have some performance data? For example how long it takes to generate tokens in a 2000 lines document? It feels like it would be very slow to trigger "goto" for each "name" like that.

I don't have performance data on a behemoth like that, but happy to gather some especially if you can point me in the direction of a big project I can try it on. Additionally, if performance ends up being an issue for huge files, it would be fairly simple to implement the range protocol which exists exactly for this purpose. From the LSP specs:

There are two uses cases where it can be beneficial to only compute semantic tokens for a visible range:

for faster rendering of the tokens in the user interface when a user opens a file. In this use cases servers should also implement the textDocument/semanticTokens/full request as well to allow for flicker free scrolling and semantic coloring of a minimap.

if computing semantic tokens for a full document is too expensive servers can only provide a range call. In this case the client might not render a minimap correctly or might even decide to not show any semantic tokens at all.

Determining when to request full semantic tokens vs. a range would then be the client's responsibility.

Ideally such feature would be implemented by jedi and use some form of caching to speed things up. The LSP semantic tokens is designed in a way that should make the case of adding/removing text pretty fast but in your implementation the whole work seems like will be done from scratch on every single change.

I agree that an upstream implementation is possible and preferable, and it would be great to contribute a portion of this to Jedi down the road. But hopefully this can work for the people who want it in the meantime.

If performance is a major concern (I agree that it would be good to gather more information on this front), we could begin by making this plugin opt-in like many of the other bundled plugins are.

rchl commented 3 months ago

I don't have performance data on a behemoth like that, but happy to gather some especially if you can point me in the direction of a big project I can try it on.

Not as big but maybe https://github.com/davidhalter/jedi/blob/master/jedi/plugins/stdlib.py

Additionally, if performance ends up being an issue for huge files, it would be fairly simple to implement the range protocol which exists exactly for this purpose. From the LSP specs:

Would it really be that easy? It depends really on whether the API that you are using for this would make it possible.

python-lsp / python-lsp-server

Semantic Token Support #533

Functions and classes

Imports

Parameters

Properties