microsoft / language-server-protocol

Defines a common protocol for language servers.
https://microsoft.github.io/language-server-protocol/
Creative Commons Attribution 4.0 International
11.08k stars 771 forks source link

Define a protocol for syntax tokens #1063

Open floitsch opened 4 years ago

floitsch commented 4 years ago

Is my reading correct, that semantic tokens only add additional information to tokens? Can we trust that the clients will render keywords correctly, if they already have a syntax highlighter (frequently regexp based).

If that's the case, should there be an option to "unset" a type. Just as an example, if you have #ifdefs you could think of giving the LSP server the option of removing syntax coloring from inactive code segments by giving it the token-type None.

rcjsuen commented 4 years ago

Is my reading correct, that semantic tokens only add additional information to tokens?

That is what Visual Studio Code does. I am not sure about other LSP clients.

Can we trust that the clients will render keywords correctly, if they already have a syntax highlighter (frequently regexp based).

If that's the case, should there be an option to "unset" a type. Just as an example, if you have #ifdefs you could think of giving the LSP server the option of removing syntax coloring from inactive code segments by giving it the token-type None.

About merging the tokens (via applying them on top) or replacing the tokens (by discarding the client information from the grammar), that has been discussed in the past (https://github.com/microsoft/language-server-protocol/issues/18#issuecomment-626688882) but it seems to have stalled a bit.

dbaeumer commented 3 years ago

All I can currently think of is to make this a client capability. I guess we will not be able to force a consistent behavior over all clients.

radeksimko commented 3 years ago

Was it ever an ambition of semantic tokens to replace the existing (static, often regex-based) grammars? e.g. do you foresee any VS Code extension using exclusively semantic tokens and have no TM grammar?

Or is there a reason why using semantic tokens as the exclusive way of highlighting files would be a bad idea?


Personally I think it would be great if the wider community could agree on a single protocol/format for syntax highlighting and adding this to LSIF would probably help too. There is just way too many different solutions to the same problem (TextMate is the most popular one, but really it's one of many).

I can see that being a long journey though as I reckon the editors would still want to provide some highlighting without LSP, assuming language servers today usually aren't built in the editors, so having to take any extra step just to get syntax highlighting to work seems like a potential source of friction in that context.

dbaeumer commented 2 years ago

Added a capability augmentsSyntaxTokens to the spec to allow clients to express the behavior. I will keep the issue for the syntax support.

sclu1034 commented 2 years ago

I'd love to see a clarification on the intended/expected scope of semanticTokens as well.

Even within just the few language servers I use regularly, there is huge variety in what highlighting tokens they provide. Some simply stick to the predefined values from the spec, some double down on the "additional color information" and provide only tokens that the client can't/doesn't know, while others go all out to provide a token for almost every character in a file.

I agree with radeksimko above that it would be great if LSP could serve as a unified provider for syntax+semantic highlighting. This would also lessen the amount of work required for the user to configure/theme highlighting, since you don't have to combine two independent systems to look nice together.

radeksimko commented 2 years ago

Some simply stick to the predefined values from the spec, some double down on the "additional color information" and provide only tokens that the client can't/doesn't know, while others go all out to provide a token for almost every character in a file.

FWIW We have recently implemented custom token types and modifiers while still keeping the predefined values from spec as fallback, so with the right use of capability negotiation this doesn't need to be a "mutually exclusive" choice.

This of course assumes that if you want to make use of these custom token types and modifiers, the client needs to implement them - and practically couple it a bit more with a particular server, but that seems okay to me - as the highlighting should work just as before for those clients which do not support these custom types and modifiers, as long as they both do the capability negotiation right.


However the whole highlighting chain in practice looks more like this language server <--> language client <--> theme. i.e. highlighting capabilities (including augmentsSyntaxTokens) are IMO largely dependent on themes, which means that if client is supposed to provide accurate capabilities it would have to somehow consult the theme.

There is also ambiguity in handling of conflicts between extensions/themes which claim the same files on the client.

VS Code extension API allows you to define mapping of your custom types to TM scopes via semanticTokenScopes. That however only solves the (admittedly more common) problem of conflicts between generic themes working with TM scopes and token-based themes. It does not address conflicts between themes which are both token-based (where one could support predefined tokens like property and the other just customToken).

Perhaps none of the above are strictly LSP problems but they are problems client maintainers will likely run into when implementing semantic token based highlighting.

~Maybe LSP could help solve these problems if it somehow reflected the reality where token types have fallbacks and they're not entirely independent of each other?~ I suppose ordering within the capability arrays can already do that but the spec isn't clear on whether ordering is important (beyond serving as a legend).

rhdunn commented 1 year ago

There are some languages (e.g. XQuery) where a state-based tokenizer is necessary in order to determine the correct tokens. This is partially because it functions as a templating language such as PHP or the Liquid templating engine where you can mix XML and XPath/XQuery code.

Additionally, XQuery can use most of the keywords as identifiers, with different versions of the language (and vendor extensions) having different sets of reserved keywords. This means that any regex-based keyword syntax highlighting will need to be able to remove keyword highlighting and specify a semantics-based highlighting (variable, function, etc.).

Having to maintain two different grammars or hand-written lexers/parsers for the syntax highlighting and everything else generally defeats the point of having the LSP separate from the editor. It also means that there can be differences in the highlighting expecially for complex languages, or language features like string interpolation and language injection (e.g. CSS in HTML style elements).