microsoft / vscode

Visual Studio Code
https://code.visualstudio.com
MIT License
164.9k stars 29.52k forks source link

Please give extensions access to native AST tokens (At least partially) #177452

Open leodevbro opened 1 year ago

leodevbro commented 1 year ago

This feature request is very similar to this: https://github.com/microsoft/vscode/issues/131062

But now I want to reintroduce it with very updated and very carefully thought reasoning.

Even though you might think that this feature request is too much (too unreal, too hard) to implement, I think it's worth to try because of the reasonings down below:

1] There are already too many VS Code extensions which need some access to the AST tree of editor code, and they don't have native access, so they use third party (or custom) libraries. And when syntaxes of languages get updated, they need to find new libraries (if current libraries don't keep up with the updates). And there are so many languages, so this process sooner or later will become unmanageable. VS Code probably always will have latest tokenizers/parsers natively, so let's just give extension developers access to those native AST tokens (at least partially, if not full AST).

2] One good example of such extensions is my creation Blockman, which has already 117K installs and the number increases by 100-200 every day. Blockman (Video intro) is well loved by many people, and also many people expressed strong desire to see this extension on other IDEs too. This is a very good indicator that the visual help which Blockman provides is very desirable for so many developers.

3] Will be able to run in web (browser) environment. Many extensions like Blockman can run only in Node.js environment, because some third party libraries does not work in browser.

4] Yeah, probably this feature will be too heavy and extensions using this feature will most likely slow down entire VS Code IDE due to the fact that parsing and providing tokens is a heavy process. Well, I would say that for tokenization work - Blockman uses debouncing of 1.2 seconds by default, so it does not freeze UI and it seems acceptable (For extensions like Blockman, I think it's not necessary to have super real-time reaction on text change. 1.2 seconds debouncing seems good enough), because there are so many people who have used Blockman for many months and they still love it and they express this love with amazing comments.

These comments are probably only 1% of the amazing comments I received from many devs. image

5] This VS Code API feature (providing AST of any file of any language to extensions) will open many other doors for new ideas of making the development environment much more easier and smoother and more pleasurable.

6] Get file AST: Maybe VS Code can provide smart incremental AST, so when user changes little text in a big file, the host will not re-tokenize entire file, but it will do smart incremental tokenization and then returns the AST object to the extension, so user experience will be faster/smoother (more optimized). Right now Blockman does not have any kind of smart incremental tokenization and it is still usable and still loved by many people. And think about how loved it will get if it has smart incremental tokenization.

When I say "at least partially, if not full AST", I mean at least the locations (positions -> line, column) of brackets (curly/square/round). Of course not the brackets inside comments/strings.

After that, it would be also good to have locations of template string tokens:

`

Then also would be good to have the locations of simple string tokens:

" and '

Then HTML/XML/JSX/TSX tags. Then Python/Yaml indent/dedent tokens.

hacker-DOM commented 1 year ago

Just use Neovim - you can do exactly that - inspect that Treesitter AST of any language. Until vs code implements / starts using TreeSitter, they probably just have a rough custom parser for brackets and stuff

hacker-DOM commented 1 year ago

Vs Code is not meant for power users, and you will struggle too much to customize it like this

ccelik97 commented 1 year ago

@hacker-DOM

Just use Neovim - you can do exactly that - inspect that Treesitter AST of any language. Until vs code implements / starts using TreeSitter, they probably just have a rough custom parser for brackets and stuff

Vs Code is not meant for power users, and you will struggle too much to customize it like this

This is hardly how you get a point across to someone. He's trying to improve the project A vs you're telling him to switch to the project B because reasons (that you have). What else now, downvoting him into /dev/null on StackOverflow? \s

Plus, the statement "VS Code is not meant for power users" is factually incorrect. I'd even argue with you that it's probably the tool that's meant for the power users (following Emacs obviously) but, here isn't the place for that.

eddyg commented 1 year ago

Seems like this would also help extensions like Bracketeer which has to include and use Prism.js.

thoroc commented 1 year ago

Vs Code is not meant for power users, and you will struggle too much to customize it like this

What a weird taken. I think you need to stick to Vim and let power users discuss their tool.

vscodenpa commented 12 months ago

This feature request is now a candidate for our backlog. The community has 60 days to upvote the issue. If it receives 20 upvotes we will move it to our backlog. If not, we will close it. To learn more about how we handle feature requests, please see our documentation.

Happy Coding!

vscodenpa commented 12 months ago

:slightly_smiling_face: This feature request received a sufficient number of community upvotes and we moved it to our backlog. To learn more about how we handle feature requests, please see our documentation.

Happy Coding!

zm-cttae commented 5 days ago

Incremental parsing is made challenging by the expensive nature of tokenization and the high likelihood of concurrent passes while editing on the fly. If there was a way to make tokenization blocking reliably sync with documents, incremental updates would be much easier for a userland library 😅

I figured out Textmate token queries and caching in a library I wrote. If this interests you, the package lives at @vsce-toolroom

leodevbro commented 4 days ago

@zm-cttae, Wow, I am testing it now, and it really seems tokenizing documents and providing such tokens:

import TextmateLanguageService from 'vscode-textmate-languageservice';
import {
  TextmateToken,
  TokenizerService,
} from 'vscode-textmate-languageservice/dist/types/services/tokenizer';

// basic example for tokenization
export async function activate(context: vscode.ExtensionContext) {
    const textmateService = new TextmateLanguageService('typescript', context);
    const textmateTokenService = await textmateService.initTokenService();
    const textDocument = vscode.window.activeTextEditor!.document;
    const tokensArr: TextmateToken[] = await textmateTokenService.fetch(textDocument);
};

const exampleOfTokensArr: TextmateToken[] = [
  {
    text: '{',
    level: 0,
    line: 2,
    startIndex: 42,
    endIndex: 43,

    type: 'punctuation.definition.block.ts',

    scopes: [
      'source.ts',
      'meta.function.ts',
      'meta.block.ts',
      'punctuation.definition.block.ts',
    ],
  },
];

You are probably an angel came from heavens.

Could you please answer those questions?

1] In the docs of vscode-textmate-languageservice npm package, I see chapter: "Use Oniguruma WASM buffer", do I need to learn how to use this Oniguruma thing? It seems too advanced tech for me. I guess the vscode-textmate-languageservice package takes care of it behind the scenes, right?

2] Does your vscode-textmate-languageservice package use directly VS Code textmate and VS Code oniguruma? Or does it use a modified fork of those? I mean, when VS Code team updates (non-breaking changes, for example performance improvement) textmate and oniguruma, will your package need manual update?

3] I don't want to implement my custom grammars or custom languages or so on, I just want to have tokens of a document text, in order to give Blockman food to eat and draw blocks. So, the question is: Is textmateTokenService.fetch(textDocument) enough for me to use? Or do I need some other things too for Blockman?

4] Do you think your package provides correct enough tokens to use in production? Are there any gotchas? special edge cases?

zm-cttae commented 3 days ago
  1. No is correct, the library will do it as expected
  2. Yes, unmodified dependencies
  3. Yes, no config needed for that interface
  4. Every gotcha around performance still exists. You would probably need to bail out documents at 1k lines. To work as a language service, the package relies on existing optimisations that VS Code makes for showing document outlines etc.
leodevbro commented 3 days ago

@zm-cttae, cool, thanks. Just one more question about indentation based languages, for example Python:

I have this Python code:

def myFn():
    print(1)

print(2)


and I get these tokens:

import { TextmateToken } from 'vscode-textmate-languageservice/dist/types/services/tokenizer';

const fetchedTokens: TextmateToken[] = [
  { text: 'def', level: 0, line: 0, startIndex: 0, endIndex: 3, type: 'storage.type.function.python', scopes: ['source.python', 'meta.function.python', 'storage.type.function.python'], },
  { text: ' ', level: 0, line: 0, startIndex: 3, endIndex: 4, type: 'meta.function.python', scopes: ['source.python', 'meta.function.python'], },
  { text: 'myFn', level: 0, line: 0, startIndex: 4, endIndex: 8, type: 'entity.name.function.python', scopes: ['source.python', 'meta.function.python', 'entity.name.function.python'], },
  { text: '(', level: 0, line: 0, startIndex: 8, endIndex: 9, type: 'punctuation.definition.parameters.begin.python', scopes: ['source.python', 'meta.function.python', 'meta.function.parameters.python', 'punctuation.definition.parameters.begin.python'], },
  { text: ')', level: 0, line: 0, startIndex: 9, endIndex: 10, type: 'punctuation.definition.parameters.end.python', scopes: ['source.python', 'meta.function.python', 'meta.function.parameters.python', 'punctuation.definition.parameters.end.python'], },
  { text: ':', level: 0, line: 0, startIndex: 10, endIndex: 11, type: 'punctuation.section.function.begin.python', scopes: ['source.python', 'meta.function.python', 'punctuation.section.function.begin.python'], },
  { text: '    ', level: 0, line: 1, startIndex: 0, endIndex: 4, type: 'source.python', scopes: ['source.python'], },
  { text: 'print', level: 0, line: 1, startIndex: 4, endIndex: 9, type: 'support.function.builtin.python', scopes: ['source.python', 'meta.function-call.python', 'support.function.builtin.python'], },
  { text: '(', level: 0, line: 1, startIndex: 9, endIndex: 10, type: 'punctuation.definition.arguments.begin.python', scopes: ['source.python', 'meta.function-call.python', 'punctuation.definition.arguments.begin.python'], },
  { text: '1', level: 0, line: 1, startIndex: 10, endIndex: 11, type: 'constant.numeric.dec.python', scopes: ['source.python', 'meta.function-call.python', 'meta.function-call.arguments.python', 'constant.numeric.dec.python'], },
  { text: ')', level: 0, line: 1, startIndex: 11, endIndex: 12, type: 'punctuation.definition.arguments.end.python', scopes: ['source.python', 'meta.function-call.python', 'punctuation.definition.arguments.end.python'], },
  { text: '', level: 0, line: 2, startIndex: 0, endIndex: 1, type: 'source.python', scopes: ['source.python'], },
  { text: 'print', level: 0, line: 3, startIndex: 0, endIndex: 5, type: 'support.function.builtin.python', scopes: ['source.python', 'meta.function-call.python', 'support.function.builtin.python'], },
  { text: '(', level: 0, line: 3, startIndex: 5, endIndex: 6, type: 'punctuation.definition.arguments.begin.python', scopes: ['source.python', 'meta.function-call.python', 'punctuation.definition.arguments.begin.python'], },
  { text: '2', level: 0, line: 3, startIndex: 6, endIndex: 7, type: 'constant.numeric.dec.python', scopes: ['source.python', 'meta.function-call.python', 'meta.function-call.arguments.python', 'constant.numeric.dec.python'], },
  { text: ')', level: 0, line: 3, startIndex: 7, endIndex: 8, type: 'punctuation.definition.arguments.end.python', scopes: ['source.python', 'meta.function-call.python', 'punctuation.definition.arguments.end.python'], },
  { text: '', level: 0, line: 4, startIndex: 0, endIndex: 1, type: 'source.python', scopes: ['source.python'], },
];

Can we easily determine opening/closing positions of each nested block? This example is trivial, but if there are complex nesting, maybe it becomes difficult to determine which text: ' ' indicates real indent and not just a neutral space, because as I see both indent sapces and neutral spaces have the same type prop and same scopes prop, which is 'source.python'.

For example, this is how dt-python-parser provides indent/dedent tokens: (this works very accurately, but it is too slow)

const classPython3Parser = require('dt-python-parser').Python3Parser;

const pyParser = new classPython3Parser();

type PyToken = {
  type: number; // 93 is INDENT, 94 is DEDENT
  line: number;
  column: number;
  start: number;
  stop: number;
};

const pyTokens: PyToken[] = pyParser.getAllTokens(pythonText);

bonus question about TextmateToken: what does the level prop mean? Why is it always 0?

.

zm-cttae commented 6 hours ago

By iterating across and filtering for a startIndex of 0 with the type as source.python, you should be able to set up a custom folding compute. This also benefits from the indentation being simple and not dependent on the lines before.

The level prop is a precursor to folding which gets block indentation level if those selectors are preconfigured in the extension. However these are limited to languages with IF and END (indent and dedent). Hope this makes enough sense 🤞