Can I get scope / scopeRange at a position?

seanmcbreen commented 8 years ago

From @billti on November 1, 2015 6:10

The API call document.getWordRangeAtPosition(position) appears to use its own definition of a word. For example, my tmLanguage defines attrib-name as a token/scope, yet getWordRangeAtPosition appears to break this into 2 words on the - character.

How can I get token ranges at a position based on my custom syntax? (And it would be really useful if I could get the scope name that goes along with it too).

Copied from original issue: Microsoft/vscode-extensionbuilders#76

seanmcbreen commented 8 years ago

From @vilic on November 1, 2015 15:19

:+1:

seanmcbreen commented 8 years ago

From @egamma on November 2, 2015 8:16

Exposing the scope names in the API is on the backlog, but will not make it into the November update.

seanmcbreen commented 8 years ago

From @jrieken on November 2, 2015 18:16

@billti despite the lack of access to scopes you can define your a custom word definition such that it will be picked up by document.getWordRangeAtPosition. You can register a ITokenTypeClassificationSupport which can contribute a regex to classify words.

seanmcbreen commented 8 years ago

From @billti on November 2, 2015 19:3

Thanks @jrieken , I spotted that, and it may be a useful interim solution. But generally for now, if I want to know the classification accurately for a position in a CFG, seems I'll need to document.getText() and run my own parser over it - is that right?

seanmcbreen commented 8 years ago

From @jrieken on November 3, 2015 9:59

unfortunately yes

hoovercj commented 8 years ago

@egamma on November 2, 2015 8:16 Exposing the scope names in the API is on the backlog, but will not make it into the November update.

Is there any update on if/when we can expect a way to get the scopes at a position or offset?

egamma commented 8 years ago

@hoovercj all I can currently say is that this is still on the backlog, sorry.

TimonVS commented 7 years ago

@egamma Any progress on this? Is there any way I can contribute? :)

siegebell commented 7 years ago

Would it be trivial to provide a command that returns a url to the TextMate grammar file being used for a particular document/languageId (or return the contents of the file to keep them read-only)? Then we could use vscode-textmate ourselves to get the token info at a particular location.

hoovercj commented 7 years ago

@siegebell -- As a short-term solution, I have successfully included a textmate grammar with my extension , referenced that, and referenced the built-in vscode-textmate package to get token scopes in an extension.

It's a pain and it really should be part of the API, but it's definitely possible to do today.

I was given the advice to use: var tm = require(path.join(require.main.filename, '../../node_modules/vscode-textmate/release/main.js')); to access vscode-textmate, but since I have a language server I had to evaluate require.main.filename in the language client and pass it over with the initializationOptions to get the right value in my server.

egamma commented 7 years ago

@TimonVS exposing the scopes in API requires that we re-visit the internal representation of scopes, this requires major re-architecting and this makes challenging to open-up for contributions.

siegebell commented 7 years ago

In the meantime, I've published an extension, scope-info, that provides an API to query the scope at a particular position. It works by querying the installed extensions for language definitions and grammars, and then maintains a parse-state for each open document using vscode-textmate. Only one instance will exist per vscode instance, regardless of how many other extensions depend on it.

Example usage:

import * as api from 'scope-info'
async function example(doc : vscode.TextDocument, pos: vscode.Position) {
    const siExt = vscode.extensions.getExtension<api.ScopeInfoAPI>('siegebell.scope-info');
    const si = await siExt.activate();
    const t1 : api.Token = si.getScopeAt(doc, pos);
}

Notes:

For typings, refer to scope-info.d.ts.
You can also query the vscode-textmate-IGrammar and scope name of a language.
Your extension should list 'siegebell.scope-info' as an extensionDependency.
If multiple extensions contribute to the same language, scope-info may pick the wrong one.
Scope-info might return a scope corresponding to a slightly newer or older document version than what your extension thinks is current.
Pull requests are welcome.

ramya-rao-a commented 7 years ago

exposing the scopes in API requires that we re-visit the internal representation of scopes, this requires major re-architecting and this makes challenging to open-up for contributions

@alexandrudima I believe the above was done as part of #18317

@aeschli Will #18068 be covering the feature ask in this current issue or are we suggesting extension authors to use https://marketplace.visualstudio.com/items?itemName=siegebell.scope-info?

aeschli commented 7 years ago

Alex added a developer tool that lets you see the tokens at a location. See https://github.com/Microsoft/vscode/pull/17933#issuecomment-271515251

There's still no extension API that returns text-mate scopes. Several reasons for that one of them that we don't want that extensions start depending on a particular tokenizer grammar.

APerricone commented 6 years ago

I think it is enough get the color at position, then associate it to an applied style: string, number, keyword...

Victorious3 commented 5 years ago

This would also be very useful for me. I'm writing an extension for a custom ebnf syntax. The textmate grammar has all the information needed to provide linting, even for renaming symbols and basic syntax validation. (For this just filter the tokens by not having any scope attached -> unexpected token & syntax error)

I currently load the 'vscode-textmate' module that comes with vscode using some dirty workaround and use that to reparse open files. It's a lot of wasted CPU time and I can't easily do incremental changes. (I assume vscode already does this internally to speed up syntax highlighting)

Here's a few functions I could use:

Get token at position
Get a list of tokens for the entire file, or in a Range
Get token text, scopes & Range
Get a list of tokens filtered by scope (this can be achieved by using the above two, but could be optimized separately)
Open additional files and get them tokenized in the background (for #include directive)

Here's my extension for some reference on how this information can be used: https://github.com/Victorious3/vscode-TatSu/blob/635d3c1351b55048feac44f09203a95f1fc0c275/server/src/parse.ts

APerricone commented 5 years ago

I don't understand why in my language extension, I need to re-parse all file to know if a character is commented, is string or not, other ideas:

grammar correction only for string and comments
separate editor for escaped string where \r \n are converted (like language injection of IntelliJ )
regex visualizer for regex token etc etc

msftrncs commented 4 years ago

aeschli commented on Jan 19, 2017

There's still no extension API that returns text-mate scopes. Several reasons for that one of them that we don't want that extensions start depending on a particular tokenizer grammar.

That's actually very sad. The grammar already did most of the work needed for making outlines, and now we have to start all over, type all the same REGEX in to a TypeScript module and repeat it to get the same data?

Extensions covering languages that don't have servers usually bring their own grammar files too, so why can they not rely on the same grammar file for both needs.

Actually I think VS Code should build the outline from the grammar scopes (for languages that don't already have a symbol provider), as it would increase the number of languages that would benefit from the outline feature. The textmate grammar system is severely underutilized.

This is in addition to the common language extensions needs, such as knowing comments and strings.

nagq commented 4 years ago

Write like this?

document.getText(document.getWordRangeAtPosition(position, /[a-zA-Z_][\-a-zA-Z0-9_]*/));

jchtt commented 3 years ago

I just ran into this issue when trying to extend auto-correct to behave in a smart way depending on the current cursor environment. Hence, I would love to see this functionality as well!

anthony-c-martin commented 3 years ago

Are there any possible workarounds to get this working with the extension test host? I'd love to be able to write an end-to-end test to validate semantic highlighting is working, but couldn't find a way.

jasonwilliams commented 3 years ago

I'd love to be able to write an end-to-end test to validate semantic highlighting is working, but couldn't find a way.

An example ive seen is here: https://github.com/styled-components/vscode-styled-components/blob/master/src/tests/suite/colorization.test.js

The idea is it has a fixture file, then calls captureSyntaxTokens and validates that against a pre defined results file. I'm not sure if there's more efficient ways but it works as an end to end test for syntax highlighting. I don't know if this changes for semantic highlighting

anthony-c-martin commented 3 years ago

An example ive seen is here: https://github.com/styled-components/vscode-styled-components/blob/master/src/tests/suite/colorization.test.js

Thanks for the suggestion! Looks this works well for testing TextMate grammars, but unfortunately not for semantic tokenization, as it invokes the grammar directly.

ImUrX commented 3 years ago

this issue would be so useful and is one of the oldest thats still open

ghost commented 2 years ago

I recently wrote vscode-textmate-languageservice precisely to exploit Textmate tokens in providers such as folding, outline/TOC etcetera. Unfortunately the performance leaves much to be lacking because the code is tokenized again - Gimly/vscode-matlab#142

universemaster commented 2 years ago

I appreciate this issue is about a vscode API.

However, are you familiar with https://github.com/draivin/hscopes ?

A meta-extension for vscode that provides TextMate scope information. Its intended usage is as a library for other extensions to query scope information.

and

This extension provides an API by which your extension can query scope & token information.

ghost commented 2 years ago

I need a dump of all the tokens in a document tbh. The information is there, it just needs to be exposed in a sane manner.

ghost commented 2 years ago

For what it's worth I have hooked into the native module using Microsoft's getCoreNodeModule trick. It works! But is also slow and retokenizes the entire document - https://github.com/SNDST00M/vscode-textmate-languageservice/blob/v0.2.2/src/util/getCoreNodeModule.ts

savetheclocktower commented 2 years ago

There's still no extension API that returns text-mate scopes. Several reasons for that one of them that we don't want that extensions start depending on a particular tokenizer grammar.

I feel silly responding to a four-year-old comment, but it is the last “official” word on this issue, so here I go.

VSCode is the third major code editor to borrow TextMate's grammar system, and I wonder if they all thought that its scope names were simply an implementation detail of its grammars. Quite the opposite — a lot of thought went into this system, since it was also used as the basis for much of TextMate's customizability.

Scope names aren't just hooks for syntax highlighting. TextMate commands are tightly woven to the semantics of scope names. You can have the same key combination perform different commands based on scope, so that your command doesn't monopolize a hotkey for something that (say) only works when the cursor is within a string. Conversely, you can define a command that behaves identically across different languages because it hooks into the presence of a generic scope name. This is how TextMate recognizes URLs across multiple contexts — inside HTML files, inside Markdown files, inside code comments regardless of language — and implements a single “open this URL in my browser” command that works identically across all of them.

Even after moving from TextMate to Atom several years ago, I was able to keep almost all of my ornery customizations because Atom allowed me to inspect scopes at the cursor. I define a command in my Atom init-file whose only purpose is to interpret Enter on my num pad and delegate it to one of three other commands based on what scope I'm in. If I migrate to VSCode in the future, it'll be a reluctant migration if this issue is still open.

TextMate's scope naming conventions are middleware. VSCode could move to tree-sitter grammars tomorrow without breaking anyone's syntax coloring themes; it'd just need to map tree-sitter token names to existing scope names. If a “get all scopes at position X” API existed in VSCode, and I relied upon it when writing an extension, that extension would keep working in a future version of VSCode that no longer supported TM-style grammars but kept TM's scope naming conventions.

You may think, “Yes, but we don't want to make these naming conventions permanent! That's the whole point!” To which I'd ask: what would you replace them with, and why? Is there something that the existing naming conventions can't do? Is there a compelling reason to invent something new that would justify the amount of community effort it would take to adopt a new scheme? Would it make a migration toward tree-sitter grammars easier or harder if syntax coloring themes had to support two different naming systems at once?

As the comments on this issue illustrate, not having this functionality doesn't remove the need for extensions to know this information; it just means those extensions have to use imperfect workarounds. And it results in tighter coupling to TextMate grammars than would otherwise be necessary, since those workarounds need to know the grammar's implementation details to reproduce the result.

I hope you'll consider this feature request sometime soon; it'd be a huge customizability win.

tjx666 commented 2 years ago

If I understand correctly, the extension author can using this to implements some function like hover tip without ast parsing. Ast parsing is really expensive sometimes.

sandipchitale commented 2 years ago

I have implemented an extension:

Show scopes at cursor in active editor

showing how to use API exposed by:

HyperScopes

Hope this helps someone.

m-paternostro commented 2 years ago

Another informal vote for this feature, if I may...

In our case, we want to correlate the content from the editor to an extremely rich repository of runtime-produced information. Without the minimum understanding of the code (precisely what symbols/scope/tokens would provide), such a correlation is faulty way more often than acceptable.

Zxynine commented 2 years ago

I would love this feature too, I keep ending up here trying to find a way to know for sure if a given position is in a comment or not, it seems like without regex and reading of a language config file there is no clear way to know that. Even if I wasnt just looking for a way to know if something is within a comment, this feature is still something I want to see implemented and would make a world of difference to many extensions.

pelmers commented 1 year ago

Like @Zxynine, I am also looking for a way to tell whether a selection in the editor is a comment. Apparently we are not the only ones. I found this commit in Better Comments which defines the line comment format for many languages: https://github.com/aaron-bond/better-comments/pull/302/commits/47717e7ddcf110cb7cd2a7902ccc98ab146f97a5

So that's one way to implement this (at least for line comments), though not my favorite.

I have also tried the API exposed by Hyperscopes (https://github.com/draivin/hscopes), but I experienced multi-second freezes of the extension host even when editing very modestly sized files.

Perhaps tree-sitter would be fast enough to parse files without delay. I see the extensions https://github.com/georgewfraser/vscode-tree-sitter and https://github.com/EvgeniyPeshkov/syntax-highlighter import web-tree-sitter (wasm-compiled tree sitter modules) to provide syntax highlighting.

Of course these are all workarounds, and I think VS Code should provide access to this information. It knows it already, after all!

I agree with @savetheclocktower that the only given justification doesn't seem adequate. Can we quantify the risk? How big of a change has happened to Textmate grammars in the last 7 years?

lukstafi commented 1 year ago

If vscode.provideDocumentRangeSemanticTokens was outputting all tokens for most languages, it would satisfy many of the needs discussed here. But from my limited experiments, it only outputs "interesting" tokens, it doesn't output tokens for comments, string literals, operators.

Fred-Vatin commented 1 year ago

Seven years this issue is open.

SEVEN YEARS !!!

I don’t expect this will be fixed soon. We’ll have to learn to live with it or move to another IDE.

PoetaKodu commented 1 year ago

Great. I cannot access information about my documents that are there, hidden behind VS Code wall. How is this a thing in 2022? Admins, do not ignore that please.

zm-cttae-archive commented 1 year ago

How to turn language & grammar contributions into IGrammar parser from context and language-id alone:
- 97af23cd/src/index.ts#L36-51
- 97af23cd/src/services/resolver.ts
How to tokenize a document in a performant, async way and then cache it: tokenizer.ts snippet

Just to be clear - if you want one line scope and scopeRange, there is HyperScopes. If you need full document use this. The code is significantly more complex because we need browser support, promise caching and cross-env resource hashing for that.

Also, I chose not to use the browser streaming compiler for onig.wasm, I used webpack instead. It's a very different approach from the monaco-tm repo.

If you use fetch you still need ${vscode.env.appRoot}/blablabla

zm-cttae-archive commented 1 year ago

I need to update my code examples to work on web because only fetch works for wasm on web

iCSawyer commented 1 year ago

EIGHT years! Eight years after eight years, do you know how I've spent the last eight years? Why is it so difficult to provide them?

zm-cttae-archive commented 1 year ago

I have solved this issue for myself and any language extension authors that can pass vscode.ExtensionContext.

EDIT: Some folk at the extension development slack want TypeScript support - I just released it this week.

There is a quite fast full-document tokenization API in vscode-textmate-languageservice - I haven't put JSDoc on it, but its stable and I'll never want/need to change API shape.

You need to set up your contributes to wire upa language and its grammar:

Language contribution: vscode-textmate-languageservice@e71fd80fbda0108ed4b6fda89a3450a902fa7397/package.json • line 44 to 48
Grammar contribution: vscode-textmate-languageservice@e71fd80fbda0108ed4b6fda89a3450a902fa7397/package.json • line 31 to 40

Then get our tokens:

import TextmateLanguageService from 'vscode-textmate-languageservice';

export async function activate(context: vscode.ExtensionContext) {
    const selector: vscode.DocumentSelector = 'matlab';
    const lsp = new TextmateLanguageService('matlab', context);
    const tokenService = await lsp.initTokenService();
    const activeTextDocument = vscode.window.activeTextEditor!.document;
    const tokens = tokenService.fetch(activeTextDocument);
};

It works in the browser and can do hugefiles quite quickly too. File hashing + caching is built in also.

There is a compulsory configuration which serves to enhance the results and generate folding level data.
If you are lazy make it {} at ./textmate-configuration.json it'll still work.

You can write your own scope and scopeRange functions by using startIndex endIndex and line properties. The line property is zero-indexed FWIW (~~the way real API line numbers should be 😉~~)

Enjoy!

zm-cttae-archive commented 1 year ago

Tokenization of Typescript and any grammar (without having to set up configuration) now available!

I used textmate-languageservice-contributes key to replace contributes so we don't override existing language contribution.

vsce-toolroom/vscode-textmate-languageservice@v1.2.1/README.md #tokenization

zm-cttae commented 10 months ago

https://github.com/vsce-toolroom/vscode-textmate-languageservice/releases/tag/v2.0.0

Add getTokenInformationAtPosition method for fast positional token polyfill: vscode.TokenInformation.
Add getScopeInformationAtPosition method to get Textmate token data: TextmateToken.
Add getScopeRangeAtPosition method to get token range: vscode.Range.
Add getLanguageConfiguration method for language configuration: LanguageDefinition.
Add getGrammarConfiguration method to get language grammar wiring: GrammarLanguageDefinition.
Add getContributorExtension method to get extension source of language ID: vscode.Extension.

Please star the project on GitHub if you think there is further use you could make of it.

zm-cttae commented 10 months ago

@alexdima seeing as this has been solved by an external library and the internal proposed API, will this be closed?

microsoft / vscode

Can I get scope / scopeRange at a position? #580