[semantic] proposals for new standard semantic token types

aeschli commented 4 years ago

The new semantic token provider API comes with a list of standard token types and modifiers. https://code.visualstudio.com/api/language-extensions/semantic-highlight-guide#semantic-token-classification

These type serve as a base across languages and by having all/most providers using theme will make easier to write theming rules across languages.

That said, semantic token providers are not forced to stick to the standard, but can add new types/modifiers, or extend existing types as seen in the doc.

This issue is to collect proposals for new types and modifiers. When making a suggestion, please add a description and a small code sample. If it exists, name the corresponding TextMate scope.

The standard token types should be be applicable across multiple languages and be useful for theming. We want to keep the set of standard tokens consistent and coherent.

Proposed types:

Identifier (extends)	Description	Ref	Sample
importKeyword (keyword)	keywords related to imports/includes	(1)	import * from x
modifierKeyword (keyword)	keywords describing a modifier	(1)	private void foo();
docComment (comment)	documentation comments	#96712	/* /

Proposed modifiers:

Identifier	Description	Ref	Sample
unused	annotated all unused symbols	(2)	let unusedVariable;

References: (1) https://github.com/microsoft/language-server-protocol/issues/968 (2) https://github.com/microsoft/vscode-languageserver-node/issues/604

kjeremy commented 4 years ago

@matklad

matklad commented 4 years ago

Types

Identifier(extenrs)	Description	Sample
attribute/annotation	syntax for attributes and annotations	`#[test] fn smoke() {}`, `@Test void method() {}`
builtinType(type)	non user-defined types	`i32`, `long long int`
typeAlias(type)	name of type aliases and typedefs	`type Unit = (); typedef int int_alias`
union(type)	name of C-style untagged unions	`union U { a: i32, b: f32}`, `union u { int a; float b; };`

Modifiers

Identifier(extenrs)	Description	Sample
unresolved	for all unresolved symbols (such symbols should also have a corresponding diagnostics)	`let my_resolved = 92; my_un_resolved;`

All extra tokens&types defined by rust-analyzer: https://github.com/rust-analyzer/rust-analyzer/blob/9cb55966fe0fee791072f275ac55b90b8ee13e32/editors/code/package.json#L522-L572

Hm, actually, unresolved might want to be a type, rather than modifier. It feels similar to unused, but if something is unresolved you, by definition, can't say which type it is. It is type in rust-analyzer.

ghost commented 4 years ago

While I think it makes sense to add in things like typeAlias and union since a larger subset of system level languages is for sure likely to use this, these still sound specific enough that I'm wondering if a theme author may not give a different shade to all of these or have trouble figuring out what shade to give: imagine somebody who just did some Python scripting and wants to make a new theme, they'll probably have a hard time judging if something like union needs a separate color and to what it should be close in shade. It's probably enough to color type, but I'm not sure how immediately obvious that is to a theme author...?

As a result, I'm wondering: should the resulting LSP semantic tokens formal specification also include some guidance on which token types and modifier types should be considered as important to have different colors? To sort of establish a baseline on what a theme is expected to cover. Or would the expectation really be that all theme authors pick something for a relatively specific thing like a union in particular, or that they know that it's not required?

While this doesn't directly matter to the protocol implementation on either side of course, I feel like it could probably be pretty relevant for how it all plays together in the end to give some guidance for theme authors here.

woody77 commented 4 years ago

Identifier	Description	Ref	Sample
documentation	for tokens that are part of documentation	(1)	javadoc, rust docs, doxygen
disabled	for tokens that are turned off by compilation flags.		`#ifdef(foo) ....... #endif` in c/c++, `#[cfg(foo)]` in rust
example/sample	sample code in comments (doc or code)		https://doc.rust-lang.org/src/std/time.rs.html#175
markdown	these types are markdown (in e.g. comments)		see above

Note that "documentation" exists today, but without any documentation as to when it's to be used, and what semantic meaning it has, so this is a proposal that comment.documentation would apply to:

javadoc
rust doc comments
doxygen in c++
other languages that specifically call out "doc comments" separate from "code comments"

disabled is something that I see VSCode do with C++ and #ifdefs, but doesn't seem to be via the semantic types. I see #ifdef'd out code with a muted set of colors (50% transparency?) but the type inspector says it should be the same as not-disabled code.

example or sample sample code in doc comments is parsed, semantically highlighted, and flagged for correctness in some IDEs, but is rendered by the themes in a mix of comment and normal formatting (say normal colors, but in italics).

markdown may be best handled in other ways, but e.g. Rust uses Markdown type headings and links in it's doc comments, but maybe the better way to handle that is is by marking them as markdown types (heading, links, etc.), and applying documentation as the modifier.

References: (1) https://code.visualstudio.com/api/language-extensions/semantic-highlight-guide#semantic-token-classification

DanTup commented 4 years ago

Should there be a token type for things like TypeScript's decorators (Dart has similar called annotations):

// TypeScript

function foo() {}

@foo
function bar() {}

// Dart

@mustCallSuper
void foo() {}

I don't think any of the existing ones fit?

aeschli commented 4 years ago

Yes, I agree that a token type annotation would be useful.

TylerLeonhardt commented 4 years ago

We have been talking about annotation in https://github.com/microsoft/language-server-protocol/issues/1067

woody77 commented 4 years ago

rust-analyzer also provides something similar (called attribute there, as both a token type and a modifier, since there can be functions within them:

The derive is a function.attribute, and the rest of the item has attribute token type. Although the Debug should maybe be marked as an interface ("trait" in Rust).

DanTup commented 4 years ago

There are types for string and number so should there also be one for boolean?

dannymcgee commented 3 years ago

Hate to resurrect a stale comment thread, but hey, how bout that decorator/attribute/annotation token. :)

I love semantic highlighting but it's killing me that the @ symbol is the only thing distinguishing my function decorators from the functions they're decorating — it makes the code quite a bit less legible.

dbaeumer commented 3 years ago

@aeschli do you have any plans to extend this in VS Code?

0dinD commented 3 years ago

What's the status on the modifierKeyword token type? Was a bit confused about it a while ago when implementing some additions to semantic highlighting in the Java language server, since a modifier token type already seems to be part of the official LSP spec. From reading microsoft/language-server-protocol#968 however, it becomes even more unclear whether or not modifier or modifierKeyword is or will be a standard token type. After some discussion, we decided to use the modifier token type as it seems to be more standardized. But I ended up having to treat it as a custom token type in the vscode-java extension anyway (declaring scope mapping etc.), since it doesn't seem to be part of the standard token types in VS Code.

I think some coordination is required between LSP and VS Code here, to make sure that standard LSP token types are also standard in VS Code, as well as agreeing on a name (modifier vs modifierKeyword). At the very least, some scope mapping for modifier would be nice, so that extensions don't have to define it themselves.

sam-mccall commented 3 years ago

annotation, builtinType, typeAlias, union mentioned above would all be useful for clangd (C++).

unresolved or maybe "unknown" too, and I think it should be a type rather than a modifier. (For those familiar with C++ templates, dependent names could be modeled as a modifier, and their tokens would be either Type+DependentName or Unknown+DependentName)

sam-mccall commented 3 years ago

What do people think of modifiers for scope? Maybe function/class/module/global

int x; // variable+globalScope
static int x; // variable+moduleScope
class C {
  int x; // property+classScope
  static int x; // variable+classScope
};
void F() {
  int x; // variable+functionScope
}

These are loose, but distinguishing global variables from function-locals at a glance seems pretty useful!

woody77 commented 3 years ago

modifiers for scope would be useful. RustAnalyzer has some custom types that somewhat work along those lines:

fields of structures
function params
bare stack variables
static variables (well, constants)

Rust doesn't have global in the same way, but the same spectrum of types applies.

stamblerre commented 3 years ago

From https://github.com/microsoft/vscode/issues/125448: A token type to represent string placeholders. For example, the %s in "Hello, my name is %s" in Go. Per @aeschli's suggestion, it could be called stringPlaceholder.

DanTup commented 3 years ago

A token type to represent string placeholders

Slightly related (though not sure if these should be types or modifiers):

Interpolation markers. They're not strictly placeholders, but should be coloured. Eg. the $, {, } in "a $foo b ${foo}".
Escaped characters. These exist in the textmate grammars (constant.character.escape) but not in semantic tokens so I had to make my own. This allows the \n in "foo\nfbar" to be coloured.
A reset (again this exists in the textmate grammar as meta.embedded) to allow semantic tokens to remove colours added by the textmate grammar. For example if the textmate grammar doesn't do string interpolation and just colours an entire string but the semantic tokens then want to layer colours on top, they might want to have some "uncoloured" sections (for example the interpolated expression contains some operators that are usually uncoloured). I'm currently also handling this myself (I made a "source" type and mapped it to "meta.embedded" in package.json), but since these types/modifiers are shared with LSP and other LSP clients won't have this package.json, it would be better to support natively.

aeschli commented 2 years ago

I added a new type decorator to be used for declrators and annotations. (see https://github.com/microsoft/vscode/issues/114082) The current TextMate fallback is meta.decorator, entity.name.function. If someone has a better fallback, let me know,

lnicola commented 2 years ago

@aeschli should decorator and label also be added to LSP?

dbaeumer commented 2 years ago

Added it.

DanTup commented 2 years ago

@aeschli I had a request for additional modifiers so that a theme author can customise colours of some keywords specifically:

https://github.com/Dart-Code/Dart-Code/issues/3926

It feels awkward to provide a modifier for each language keyword - are there any guidelines on how fine-grain these should be? Would it be a reasonable/feasible VS Code feature request to allow the text content of a token to be used by theme authors? (for ex. keyword['void'])

dannymcgee commented 2 years ago

@aeschli I had a request for additional modifiers so that a theme author can customise colours of some keywords specifically:

Dart-Code/Dart-Code#3926

It feels awkward to provide a modifier for each language keyword - are there any guidelines on how fine-grain these should be? Would it be a reasonable/feasible VS Code feature request to allow the text content of a token to be used by theme authors? (for ex. keyword['void'])

@DanTup What I generally try to do is use the TextMate grammar to map out most of the syntax, and only use semantic tokens to give semantic meaning to identifiers (e.g., to distinguish between a class, an interface, and a type alias — something that you can't really do without parsing the source code). Keywords are trivial to catch with a regular expression, and then you can just use a back-reference to insert the matched text into the TM scope:

{
  "match": "\\b(if|else|switch|case|for|while|break)\\b",
  "name": "keyword.control.$1.languageid",
}

Then a theme author could use, e.g., "keyword.control.for" to make that specific keyword its own color if they really wanted to.

Marking up the entire syntax with semantic tokens is something I would try to avoid personally (or hide behind a configuration flag if you need to provide those tokens for editors other than VS Code), because VS Code treats semantic tokens a bit like an ID selector in CSS, which does really limit the flexibility of theme authors and end users to customize the syntax colors in a granular way.

DanTup commented 2 years ago

@dannymcgee I don't think adding configuration to the server to produce a reduced set of tokens would be a good fit here. It would mean the server has to have some knowledge of the specific client and its textmate grammar (which may change over time). I'd prefer to add additional modifiers than that, but I was hoping there could be a better way (themes are the sort of things people really like to make their own, so being able to customise some specific tokens without the servers needing to mark them all up individually seems like a powerful feature).

aeschli commented 2 years ago

@DanTup Currently we need all semantic token types and modifiers to be known beforehand. So yes, there's no alternative to list them.

dannymcgee commented 2 years ago

@dannymcgee It would mean the server has to have some knowledge of the specific client and its textmate grammar

Couldn't you just use a simple toggle that either a) tokenizes everything or b) tokenizes only identifiers? (The latter is the option I would personally prefer as an end user.) It doesn't require any knowledge of the specific grammar (or even the specific client), just a general idea that certain clients may be supplementing the semantic tokens with some other tokenizer, so they only need specification of semantic (as opposed to syntactic) information.

For what it's worth, it wouldn't be without precedent — that's how the TypeScript implementation works, and Rust Analyzer has an option to skip tokenizing strings. (But no pressure, obviously, it is your project. 🙂)

HighCommander4 commented 2 years ago

Couldn't you just use a simple toggle that either a) tokenizes everything or b) tokenizes only identifiers? (The latter is the option I would personally prefer as an end user.) It doesn't require any knowledge of the specific grammar (or even the specific client), just a general idea that certain clients may be supplementing the semantic tokens with some other tokenizer, so they only need specification of semantic (as opposed to syntactic) information.

If I'm understanding you correctly, the augmentsSyntaxTokens capability in the upcoming 3.17 version of the spec is precisely such a toggle.

DanTup commented 2 years ago

Couldn't you just use a simple toggle that either a) tokenizes everything or b) tokenizes only identifiers?

I don't think so - the semantic tokens are adding more value than just identifiers. There are a lot of things that are complicated to handle 100% accurately in the textmate grammar (expressions in string interpolation can include keywords, for example, and documentation comments can include full code blocks).

Even with a built-in toggle, it seems like assumptions would have to be made about what the client is otherwise colouring, and unless/until LSP allowed us to provide the textmate grammar to the client, that's something I'd prefer not to make assumptions about (at least, not for something minor like a small number of users wanting to customise colours of a few specific keywords).

My real question is really about how fine-grain these tokens/modifiers can/should be. I can easily handle this by just adding a custom modifier for every keyword (we already have a lot of custom modifiers to help theming), and that feels better to me that producing a restricted set of tokens - but I don't think it's as good as VS Code having more flexible built-in theming (since anything I do specifically for my language will not necessarily be consistent with other languages).

Jamesernator commented 2 years ago

So I have my own custom theme that uses a mapping from semantic token -> textmate tokens so that I can write my theme entirely semantically and have it work on non-semantic languages automatically. For the most part the semantic tokens cover most things I've come across however there are a few semantic token types that would be helpful as quite a few tokens simply have no corresponding semantic token to denote them.

Of note is a lack of semantic tokens for HTML/XML like tokens (semantically I don't feel the existing tokens cover any of these even if some could be contrived like class<->tag):

tag
- Corresponding to <tag> in HTML/XML etc
text
- Corresponding to text in HTML/XML etc, NOTE this differs from string literals in that attributes would generally be colored as string literals, but text content would differ, this can be seen in a sample on github like:
```
<tag attr="value">Some text</tag>
```
  In this example the text semantic token would refer to Some text, but the existing string token would be used for "value" (in the attribute)
attribute
- Like the other two HTML/XML tokens suggested, attribute would refer to HTML attribute names (not their values)

From adding rules for JS, I found of particular help distinguishing would be:

boolean
- Other literal types like number, regexp, string exist, but not boolean which is supported by many languages
constant
- Would cover literal types without a more specific type like number/regexp/string/etc
.operator modifier for keyword
- Some keywords like new are more semantically like operators than other keywords
.expression modifier for keyword
- Some keywords are semantically more like values than "keywords", for example this
.storage modifier for keyword
- Some keywords specifically denote kinds of storage, for example const/let/var/readonly/private etc
.control modifier for keyword
- For keywords that declare control structures like if/for/while/etc
null
- For the null literal (similar to number/string/etc), very common literal in languages
.assignment modifier for operator
- Would cover operators like =, +=, etc, generally want to visually distinguish these from expression operators
.comparison modifier for operator
- Would cover operators like ==/</>/etc
.logical modifier for operator
- Would cover operators like &&/!/not/and/etc
.arithmetic modifier for operator
- Would cover operators like +/-/*/etc
punctuation
- There should be a semantic way to refer to punctuation, like ., {, (, etc etc, modifiers would probably be desirable here (though personally I just color them all grey)
.characterClass modifier for regexp
- This would target [a-z] and similar inside regexps
.escape modifier for string and regexp
- This would target \n, \u2202 and similar
.delimiter for string and regexp
- This would allow targeting the quotes and slashes for strings and regexps

aeschli commented 2 years ago

@Jamesernator Thanks a lot for sharing!

iDad5 commented 2 years ago

I don not have a clear idea of what kind of modifier to add, Something like @DanTup suggested here seems an option to m, but I#m far from havin a deep understanding. Trying to create I theme though I found that the scope of variable.defaultLibrary in JS and TS ist very broad and overrides quite a lot, probably other *.defaultLibrary in various languages do too. I'd guess that I'm not the only one who would like to give visual preference for certain built in constructs over others.

I came upon this, when I tried to give special emphasis to to console which by nature has (for me) a very different scope and use than in built constants like Math.

MartinGC94 commented 1 month ago

How about Command Arguments (alternative names could be bare quote strings or generic tokens)?
Command line languages like PowerShell, Batch, Bash, etc. allow you to run commands like: command -parameter argument where the argument can either be a quoted or unquoted string value. Whether or not the string is quoted is important info because it affects how Bash handles wildcard characters and PowerShell includes similar logic when calling native programs.

microsoft / vscode

[semantic] proposals for new standard semantic token types #97063