Expanding semantic theming to support semantically embedded languages for colorization

NTaylorMullen commented 4 years ago

Overview

Languages like Razor (and I imagine HTML for custom attributes) typically have scenarios where portions of the document are semantically a different language.

In Razor this happens frequently through the use of TagHelpers or in Blazor:

<form asp-antiforgery="ViewBag.ShouldRenderAntiforgery">
...
</form>

In this example we'd expect that ViewBag.ShouldRenderAntiforgery would be C#. The way TagHelpers (things that apply to HTML and change the semantic language of the right hand side of an attribute) can be customized by users is limitless so we need to have full control over telling the IDE what things are C# and what things aren't.

Ideas on how to implement

Over in the issue where the discussion of general semantic colorization the proposal was to have an API similar to:

interface SemanticHighlightRangeProvider {
   provideHighlightRanges(doc,...): [ Range, TokenType, TokenModifier[] ][];
}

The proposed approach can also be used to enable semantic language colorization without enabling an entire language's extensions for a subset of a document.

To do this one could expand on the ThemeDefinition and add a language parameter:

interface TokenStyle {
   foreground?: Color | ColorFunction
   style?: bold | italic | underline
   language?: string
}

This would enable LanguageServers to mark a chunk of text in an editor as a Token that associates with a specific language. This would work similarly to how tooltips/completion descriptions etc. work when specifying pieces of text that should be colorized as a certain language.

So in the first example asp-antiforgery="ViewBag.ShouldRenderAntiforgery" Razor's language server would indicate that the entire ViewBag.ShouldRenderAntiforgery was of a Razor specific theme that had a token Style of:


{
    "language": "csharp"
}

This would enable future scenarios where if Razor wanted to go above and beyond to provide semantic colorization of specific tokens it could do it in an additive fashion on top of existing theme.

aeschli commented 4 years ago

Assigning to @alexandrudima who works on the new API

pgfearo commented 4 years ago

The approach of having a token covering the entire range of the embedded language won’t work universally. For example in XSLT text-value-templates, the}. This character can also appear unescaped within the embedded XPath language, it is the XPath syntax tree that determines when the embedded expression has terminated.

For cases like this, there needs to be a way to handoff to the embedded language server but then for this embedded language server to hand control back to the host language.

Sufficient context needs to be passed to the embedded language server also, for example to indicate if the text-value-template occurs within an XML CDATA section, or to differentiate it from an attribute-value-template, this context should be passed back to the host language server when returning control.

Having said this, if it were found that an embedded language-range token covered 90% of cases, this could still be valuable.

arcanis commented 2 years ago

Given that the last comment was more than two years ago I'll risk a little +1 - I'm working on an extension to allow injecting pegjs syntaxes into VSCode (using the Semantic Highlight API as a kind of dynamic grammar), but the lack of embedded language support severely limits what I can do.

For example, PegJS syntaxes themselves allow snippets of JS to be registered; achieving this currently requires to setup the embedded language in the tmGrammar.json file, which prevents dynamic language embedding. I tried to workaround the problem using semanticTokenScopes in an attempt to give an embedded language scope to my semantic tokens, but that didn't seem to work.

SteveBenz commented 2 years ago

Piling on here, as I think the overall ask is for a way to enable semantic processing of embedded languages...

What I want to add is an extension that looks at comments and tries to determine if they're worth reading or not. E.g. how many times have you seen this:

/// <summary>Gets or sets the FrobKnocker for this instance.</summary>
public FrobKnocker FrobKnocker { get; set; }

It's a comment alright, but it adds little value.

The way to implement this that seems open would be to create an injected language for comments and then have that language inspect the comment and its immediate surroundings to determine if there's any real added information there or not.

There might be other ways to implement something like that, aside from language support (e.g. as a static analysis layer, perhaps?) But there are other aspects of comments that are truly language concepts and they're effectively an injection - e.g. you can use doxygen in any language, and there are several competitors in that space.

microsoft / vscode

Expanding semantic theming to support semantically embedded languages for colorization #81558

Overview

Ideas on how to implement