Add ability for language server to request handling part of file as different language

dancojocaru2000 commented 3 years ago

Hello! I want to propose the ability for a language server to instruct a client to handle part of a file as if it's another language.

The biggest usecase would be embedding a language like HTML in a constant string in another language. That way, as part of the string, code completion and the like will be provided for the other langauge. (See example at the end).

The way the user informs the language server that this is desired would be language specific. In a dynamic language, a comment could be used. In a static language, something like annotations could be used.

The client could potentially handle this by treating the part of the file with different language as a separate "virtual" file, and allowing the other language server to operate on the "virtual" file.

Optionally, arguments could be specified in order to provide useful info to the other language server. An example would be where to find the JSON Schema for a JSON string.

Example in pseudo-C#:

public class Test 
{
    [StringLanguage("html")]
    public const string TEMPLATE = "<div>Test</div>";
}

Example in JavaScript:

/** @language css */
var test = "#element { text-align: center; }"

document.getElementById("yes").innerHTML = "<span>JS lang server could know that innerHTML is an HTML string and instruct the client accordingly</span>"

Example in Ruby:

# @language json(schema: "url to schema")
query = <<-JSON
{
  "this": "json",
  "is auto": "completed based on",
  "the schema": [
    "provided",
    "as parameter"
  ] 
}
JSON

mikesamuel commented 3 years ago

Here are some use cases and corner cases around nested language handling with a stalking horse to stimulate discussion. I'm sure there are others.

Maybe the following flow would work:

Client receives metadata associating token types with nested languages.

"nestedLanguages": {
   "someTokenType":  { "languageId": "nested-language-id" },
   ...
}

Client receives token stream, either by applying a contributed grammar or from a semantic tokens response.

Client issues a decodeEmbeddedLanguageContent request:

{ "method": "decode-nested-language-text",
  "text": "foo &amp;= bar"
}

Server responds with the decoded text. If the nested language metadata did not specify a language id here, the server may specify one here. Else language identification would fall back to first-line, etc.
```
{
  "text": "foo &= bar",
  "decodePositionMap": { ... },
  "languageId": "different-nested-language-id"
}
```
An untitled document is created with the decoded text and as long as the nested language id is associated with a server that can handle untitled documents, it is parsed as normal.

If a set of documents contains multiple nested language tokens that decode to the same textual content,

A client may cache the results of unwrap-nested-language-text requests so that edits to a nesting document that do not affect the textual content of a nested language token may not cause unwrap requests.

Decoded position remapping

Actions that highlight text or move the cursor may need to work through nested languages.

<span onclick="// line comment&#10;console.log(&quot;Hello, World!&quot;)">

The text of the onclick attribute above might decode to

// line comment
console.log("Hello, World!")

Simple operations like pairing parentheses require mapping token positions in a nested code document to actual positions in the nesting document.

One way to handle this is for decode-nested-language-text responses to include, at a minimum, a mapping from Positions of characters that do not decode to exactly one character to the number of characters they decode. These could be packed into int[] using a similar scheme to the semantic tokens data.

// line comment console.log("Hello, World!")

The example above might decode to [0, 14, 5, 1, 0, 15, 6, 1, 0, 19, 6, 1] since

the 
 at position [0, 14] consists of 5 characters and decodes to 1
on the same line, 15 characters later, " consists of 6 characters and decodes to 1
on the same line, and 19 characters later, the second " consists of 6 characters and decodes to 1

Reencoding substrings

Actions that edit nested text, like refactoring may need to re-encode text.

For example, in

<button onclick='console.log("Hello")'>

a change that applies lint rules to normalize quotes in console.log("Hello") to console.log('Hello') might need to re-encode so that the HTML becomes

<button onclick='console.log(&#39;Hello&#39;)'>

It is not always straightforward to re-encode program text, so re-encoding requests may fail as when removing parentheses around the array access below which would cause the ]] to merge with > and be interpreted as ]]>, an end to the CDATA section in the embedding document, instead of as tokens in the embedded document.

<svg><script>//<![CDATA[
if ((arr[arr[i]])>0) { ... }
//]]></script></svg>

It is probably not possible to re-encode with minimal changes in all cases, as in data:image/svg+xml;base64,... where the nested content is textual but nevertheless includes a transform such that most of the characters after a re-encoding that changes decoded text length will change.

Semantically significant file/position metadata

How do chunks of nested language text use macros like cpp's __FILE__ and __LINE__ and Swift's #line that depend on file name and position information, interact with untitled documents (step 5 above)?

Should something allow attaching the position of the nested language token to the untitled document?

Does this require lots of re-parsing on inserts into the embedding document before the nested language token?

mickaelistria commented 3 years ago

FWIW, in Eclipse IDE, one can defined derived ProjectionDocument from a master document. This allows mapping of subparts of the document. This is typically used for folding, but IIRC some tools used to leverage it for eg SQL assistance in .java files. One possibility, instead of sending annotations, would be that the LSP specifies that a server can send "projected" documents that consist of subparts of the master document + a mapping + some info about the language; and that client process such projected documents with the appropriate language server. One benefit is that the LS could decide of treating blocks independently or together (eg if declaration in one of the blocks can be used by some other blocks). Basically doing a 1-1 or 1-N mapping between documents and blocks. One difficulty would be how to express derived/projected documents as URIs, since it's the only thing LSP understand. I imagine it could be some extension to the existing TextDocumentItem

mikesamuel commented 3 years ago

@mickaelistria

One possibility, instead of sending annotations

I think the reference to annotations in the original was an example of syntactic cues an embedding language might use to indicate an embedding. I don't think it was a suggestion about how different LSP agents communicate.

mikesamuel commented 3 years ago

@mickaelistria

I think your larger point that the LS for the embedding document is the source of the embedding relationship is a good one.

It seems like the kind of thing that might not be realized until later stages. For example, a DSL might only be recognized as such after imports are resolved as in

// javascript
import { someDomainSpecificLanguage } from '...';

let x = someDomainSpecificLanguage`
    source in domain specific language here
`;

The embedding relationship may only be apparent after the langserver has some information about the imported identifier someDomainSpecificLanguage.

workspace/semanticTokens/refresh should suffice to cause a re-request of token information where a run of whole tokens correspond to a block.

dbaeumer commented 2 years ago

I will close the issue (but feel free to continue the discussion). I think LSP should not promote a model how to do this. I think both using forwards or embedded services is a valid solution.

What we are working on in LSP is to

allow servers to implement a FS provider (work started here: https://github.com/microsoft/vscode-languageserver-node/pull/782)
allow to delegate requests back to the client
call FS API

This will allow servers to implement a forwarding model for embedded languages. However it will not be a general solution for embedded languages

microsoft / language-server-protocol