microsoft / vscode

Visual Studio Code
https://code.visualstudio.com
MIT License
163.81k stars 29.12k forks source link

Help TS team to assume ownership of semantic highlighting in TS tooling #92789

Closed aeschli closed 4 years ago

aeschli commented 4 years ago

We've been working on our semantic coloring support. This is currently powered by a TS Server plugin but we’d like to merge this work back into TS Server.

The plugin's readme descibed it's purpose and a list of the implemented features.

Also see the Tests for more details: https://github.com/aeschli/typescript-vscode-sh-plugin/blob/master/src/test/semanticTokens.test.ts

Along with the adoption goes the request to improve the encodedSemanticClassifications-full command to also support deltas. See https://github.com/microsoft/vscode/issues/92963#issuecomment-601208605

DanielRosenwasser commented 4 years ago

In https://github.com/microsoft/vscode/issues/92963#issuecomment-601208605, @alexdima mentioned something about a delta strategy.

I can't find any place where this delta strategy is documented. Can you please elaborate on that?

alexdima commented 4 years ago

@DanielRosenwasser Here is the relevant documentation. A DocumentSemanticTokensProvider can optionally implement a method provideDocumentSemanticTokensEdits. This method will be called by the editor after a while and the provider has a chance to give edits relative to the previous result. You can think of them like updates to the previous semantic tokens, flowing from the provider to the editor. The JS Doc explains it in greater detail.

DanielRosenwasser commented 4 years ago

One of the things we brought up when this was first discussed is that TypeScript doesn't have a model where we can correlate the current semantic state of the world with that of the last semantic state. I don't think that that's accounted for in this model.

In Visual Studio we've had APIs to provide semantic classifications which are handled by Roslyn and we simply haven't had to worry about any issues there. This has worked for years, and given the constraints that I've mentioned above, along with the fact with this API each language server has to figure out how to achieve this, I think we need to rethink the approach here.

CyrusNajmabadi commented 4 years ago

In Visual Studio we've had APIs to provide semantic classifications which are handled by Roslyn and we simply haven't had to worry about any issues there. This has worked for years, and given the constraints that I've mentioned above, along with the fact with this API each language server has to figure out how to achieve this, I think we need to rethink the approach here.

I would agree with this. It doesn't make sense to me for languages to have to figure it out. It's non-trivial and would involve a lot of work per-language. In Roslyn we allow the languages to just be 'dumb'. They compute their syntactic or semantic classifications however they see fit, and we just update our host intelligently with the minimal change that's appropriate.

Note that this is what we do with all our taggers. classification, outlining, line separators, reference highlighting, etc. etc. etc. All of these sit on systems where the language is just asked to compute simply, and the host (i.e. Roslyn) intelligently manages those and ensures minimal updating to actual clients.

Having this complex logic centralized has been a boon for all roslyn languages to be able to focus on the language tasks and not on complexities of semantic changes and diffs. I don't think it's reasonable or appropriate to now push that back out onto all the languages again.

RyanCavanaugh commented 4 years ago

Sending deltas is only solving a small part of the problem. We also have to consider that the language side is going to end up spending a lot of time computing semantic information for highlights that are off-screen and remain off-screen - a user editing a large file might only ever scroll a tiny portion of it into view. For TypeScript, a question like "Is this JS identifier actually a class or a function" involves a surprisingly large amount of work, e.g. looking for prototype assignments, looking for this. bindings, and other heuristics; these not individually expensive, but in total can represent a lot of work if done repeatedly over thousands of occurrences.

Worse, edits near the top of the file are going to invalidate a lot of semantic highlights as the file goes in and out of a syntactically valid state, so even under a delta-based scheme we're going to have to send back a bunch of (nontrivially computed) diffs for off-screen tokens.

Even considering none of that, we're looking at 5+ seconds of extra startup responsiveness delay for a large file as VS Code processes the initial highlighting data, which deltas will not fix any of,

I'd like to revisit #92963 to address the underlying concern here and ship a performant feature - VS Code has already invested a lot in making the experience shine in large files and I think we need to continue to keep these scenarios in mind.

alexdima commented 4 years ago

The purpose of DocumentSemanticTokensProvider is to provide semantic tokens for the entire document, in such a way that after those semantic tokens have been provided, scrolling in the editor can render synchronously all the text without flickering, even if there is latency to the language server, like when using VS Code Remote SSH, where the UI and the language server are separated by the internet. This solution is designed for an asynchronous world, where the editor cannot make a synchronous call to the languager server and ask it for tokens in the viewport when painting. I will focus my examples on the DocumentSemanticTokensProvider, but if you find this one too difficult to implement (I'm not sure why that would be the case), there is also a DocumentRangeSemanticTokensProvider, which fetches semantic tokens only for the viewport, but which flickers because they must be fetched asynchronously.

I'm going to try to explain the idea behind provideDocumentSemanticTokensEdits and going to suggest a simple implementation. provideDocumentSemanticTokensEdits is first and foremost about reducing data on the wire. It can also be about computing less, but that is really up to how the language server itself is implemented.

I find it best to reason about an example. So, suppose you open checker.ts. The language server receives an event that the document was opened in the editor and that from now the "truth" lies with the editor and that changes will be pushed to the language server. On some version of TypeScript I have checked out, that file contains 81,047 semantic tokens. With the encoding of 5 x 32 bit integers per token, that is 405,235 x 32 bit integers, or about 1.6MB "on the wire". The encoding can definitely be reduced to 3 integers per token, and that is doable, but that is not so important to this reasoning.

Let's suppose the 1.6MB semantic tokens have made it to the editor, they are now rendered and all is good. But suppose a single edit is done: a new line is inserted at the top of cheker.ts. The editor will adjust the semantic tokens on the UI thread, and they will "stick with the text", so all the tokens will move down, but now the editor must consider that these tokens are potentially invalid. As far as the editor is concerned, the editor cannot understand that in this programming language \n does not carry semantic meaning. So, as far as the editor is concerned, it behaves very similarly when you insert /* and when you insert \n, even though they cary different semantic meaning in TS.

The simple small edit is communicated to the language service with a single text delta event that is small: it will just say that at (1;1) a \n got inserted. So just a few bytes, an event proportional in size to the size of the change. But eventually the editor needs to validate those semantic tokens (think to cover the case of typing /*, it cannot just assume that sticking the tokens with the text is correct). So the editor needs eventually to ask the DocumentSemanticTokensProvider for semantic tokens again. That means that in the \n case, 1.6 MB worth of data needs to be sent again "on the wire" to the editor. All the tokens will be very similar to the previous tokens, but they will have their line number increased by one (or their offset will be increased by 1), depending on the chosen tokens encoding (line,col or offset).

So that is where the provideDocumentSemanticTokensEdits comes in. Instead of sending 1.6MB worth of tokens again on the wire, it is possible to send one or more edits to the previous result which describe the new result, instead of sending the new result in full.

Now, for the simple implementation I have promised. On the language server side, when a provideDocumentSemanticTokens request comes in, compute all the semantic tokens in the file by walking the entire AST. Let's call this result X0 and let's store it in a map associated with the document. Only keep a single semantic tokens result per document. If there is any other previous semantic tokens result stored, delete it. If the document gets closed (the editor will let you know with an event), you can delete the result X0 from the map of open documents.

For a subsequent call to provideDocumentSemanticTokensEdits, X0 (an identifier that you chose before) will be passed in the request. If you find X0 in the result map, proceed, otherwise throw an error. Compute all the semantic tokens in the file by walking the entire AST and generate the result X1. Now you have N0 integers from X0 and N1 integers from X1.

You need to compute one or more edits which brings X0 to X1, let's call that diff(X0,X1). The best way is to implement a LCS algorithm (one that is typically implemented in a diff algorithm, like Myers's), but that might be slow. A simpler way is to check the prefix and the suffix for equality and strip them, and then consider the entire range in the middle as an edit (e.g. axc => ayc, where a, x, y and c are sequences of integers can be expressed as replace x with y). Delete X0 from the stored state and store X1 in full. Return the diff(X0, X1) that you computed before.

If you don't want to do this, then that is ok, simply don't implement provideDocumentSemanticTokensEdits. But if you implement that, 1.6MB are saved almost on each keystroke.

DanielRosenwasser commented 4 years ago

The thing is that there's a huge cost in this one step:

Compute all the semantic tokens in the file by walking the entire AST and generate the result X1

The minimum overhead of this is:

  1. Walk the whole AST, which is already going to have overhead on any reasonably sized file.
  2. Resolve information about every identifier in the file, which can potentially trigger a full type-check of your entire program.

We're already seeing the negative effects of this - that's the 5+ second delays we're seeing locally.

On the other hand, doing that huge JSON serialization and then going across the wire - that's also unacceptable. Serializing and deserializing about 2 million integers in arrays of arrays takes about 500ms on my machine, and nobody wants that.

var result = []
for (let i = 0; i < 2000000; i += 5) {
  let tuple = [];
  for (let j = 0; j < 5; j++) {
    tuple.push(j);
  }
  result.push(tuple);
}
var start = Date.now();
var yadda = JSON.parse(JSON.stringify(result))
var end = Date.now();
console.log(end - start);

So both of these problems need to be fixed. Regardless of the API, the bottom line is that the editor is potentially going to be asking for language services to calculate too much stuff.

alexdima commented 4 years ago

@DanielRosenwasser I think we are in agreement. What I outlined above is a very simple implementation, which someone without intimate knowledge of the compiler can do in probably under 1 day.

So the slowness of serializing a large result over and over again, and the slowness of downloading it over wi-fi potentially (in the remote SSH case) over and over again is resolved by the proposed API, which allows for an edit to be sent.

But to be honest, the slowness of walking the AST and generating semantic tokens is purely under the control of the TS language service, so I cannot prescribe in the VS Code API how to eliminate that slowness.

I can only ask: why are suggestions so fast? A simple implementation for suggestions which can be implemented in under 1 day would be to walk the entire program and all the ASTs and collect all possible reachable symbols... Yet, perhaps you are doing something faster since it doesn't take 5+ seconds? Perhaps you spent more than 1 day on this?

Perhaps something can be engineered by a TS developer with intimate knowledge of the compiler such that it is not necessary to type check the entire program when computing semantic tokens? Perhaps when looking at an identifier it can be decided quicker if it is a class type name, an enum type name, a variable, a constant or an argument... etc without having to type check all type constraints? Perhaps it is possible to cache semantic tokens with the AST and invalidate them only selectively when typing? For example if I am typing inside a comment, or in a method body, computing semantic tokens could be super fast since most of the symbols in the file have not changed.

Also, if the 5s+ is spent in type-checking the program, it means that those 5s+ would have been spent anyways, when computing diagnostics, or when hovering, right? How much is work that would have been done anyways? Another idea is that you could also delay answering to the request to when you do have a type checked program. There is one more thing I forgot to mention, you can throw busy from the language server and then we'll ask again when the user edits the file again ... We are trying to iron that into a proper type in #93686

All I'm trying to say is that I believe something better than walking the entire AST and typechecking the entire program can be built, but I think you are in the best position to build that. But I agree, if the dev budget for this feature is 1 day, then it will take 5s+ on checker.ts.

alexdima commented 4 years ago

Another idea I just had. If X0 is kept in memory, and on top of that some kind of mapping from the AST to the positions of X0, a fast way to compute X1 would be to use all the data from X0 except the AST nodes that were changed.

So if X0 contains the entire checker.ts and I type 5 letters inside a method body, all the data from X0 outside of the method body could be used to compute X1 and only the method body would need to be traversed / re-type-checked to compute X1.

DanielRosenwasser commented 4 years ago

@alexdima it sounds like what you're asking for is some sort of eureka-style optimization that we can apply to a problem space which we've thought about for years (reusing semantic information between compiles) for a feature that's already been implemented reasonably well for other editors. I really urge you to reconsider: is this eureka-style optimization something you think every language service is going to be able to implement independently? Does it exist?

I can only ask: why are suggestions so fast?

This question is pertinent, and I'm glad that you asked it. Apart from incremental parsing, the reason that features like completion, quick info, go-to-definition, etc. are reasonably quick isn't that they're doing super special optimizations - it's that they're doing the minimum work necessary to provide an answer for a client. If you were to ask for quick info on every identifier in the program, you would hit similar scaling issues on any reasonably-sized file.

The way that semantic highlights is able to stay reasonably fast in Visual Studio is through a very similar approach: editors have only requested highlights around the user's viewport. The server only needs to dive into the subset of the tree within the range requested by the editor. It only resolves symbols in that range. That cuts out a large amount of computation.

So ultimately that's what we're asking about: can the API ask for less in an intelligent manner?

alexdima commented 4 years ago

@DanielRosenwasser Sure, then implement only a DocumentRangeSemanticTokensProvider. But then if I scroll quickly to a location, it will first render without semantic tokens and then semantic tokens will be fetched and the text will flash to the colors of semantic tokens.

I ask that you double check that Visual Studio is doing 0 synchronous calls from the UI thread to tsserver. Is Visual Studio prepared for a latency of >50ms for the result of that call? Does Visual Studio test loading the tsserver over the internet? How does Visual Studio not flash ?

alexdima commented 4 years ago

Here is how typescript looks like with only a DocumentRangeSemanticTokensProvider. Notice "the flash":

TO_UPLOAD

alexdima commented 4 years ago

I can obviously eagerly request 20 lines above and below the viewport, but the flash will be there given a higher latency or a higher scrolling speed.

RyanCavanaugh commented 4 years ago

Sure, then implement only a DocumentRangeSemanticTokensProvider. But then if I scroll quickly to a location, it will first render without semantic tokens and then semantic tokens will be fetched and the text will flash to the colors of semantic tokens.

This is what we'd like the VS Code TS plugin to do, then.

The theme colors should be designed such that the flickering is not distracting. VS does this and it works very well. By choosing good colors, it's e.g. a shift from blue to darker blue, or blue to a teal color. When jumping between file locations it is certainly visible but is not at all bothersome; we don't receive complaints about it. Having SyntaxKind go from some dark color to bright red is just the theme designer not being aware of this constraint.

A good semantic highlighting scheme makes the semantic information subtly different from the nonsemantic coloring of the same identifier; I think people get tempted to add very bright "it's obviously working!"-style colors when they first add a new form of colorization, but users don't like this kind of highlighting as I think you've seen from feedback.

dbaeumer commented 4 years ago

@CyrusNajmabadi the reason why we ask the server to do some delta encoding is to minimize the data that needs to go over the wire (e.g. from the server to any client) because semantic coloring for a whole file can be heavy weight. In a single process world I do agree that the language smarts shouldn't be bothered with this. We tried to make it very easy for a server to compute the diff since it is not a semantic diff. It is a number array diff which in most cases should be easy to compute.

I also want to point out that from the discussion I had with other teams only semantic coloring the current range is not always enough. Doing so will result in the fact that the mini map will not show semantic colors or incorrect semantic colors. For example C++ uses semantic coloring to highlight active and inactive regions (rendered gray) in the minimap.

alexdima commented 4 years ago

In my role as an API provider, this is a great outcome. The vscode API comes with both DocumentRangeSemanticTokensProvider and DocumentSemanticTokensProvider. They have different tradeoffs, and you understand the tradeoffs well and chose the best one for you. I can be happy that the API has you covered.

As a daily user of TypeScript, this is not the best outcome, because I will be able to tell if I ever see a flicker why that is, and I know it could have been better. But I have hope that maybe one day you could implement an efficient DocumentSemanticTokensProvider, and from our discussion it doesn't appear to me that the API shape is what would hold this work back.

In conclusion, I think this is under the control of the TS extension and TS language server, so the API is good to ship in its current form this milestone.

RyanCavanaugh commented 4 years ago

Perhaps something can be engineered by a TS developer with intimate knowledge of the compiler such that it is not necessary to type check the entire program when computing semantic tokens? Perhaps when looking at an identifier it can be decided quicker if it is a class type name, an enum type name, a variable, a constant or an argument... etc without having to type check all type constraints?

But I agree, if the dev budget for this feature is 1 day, then it will take 5s+ on checker.ts.

But I have hope that maybe one day you could implement an efficient DocumentSemanticTokensProvider

@alexdima the repeated implication that we just haven't thought about this problem very hard, and are unwilling to do so for more than a day, isn't useful here, and I want to make sure you understand that we have indeed thought about this quite hard so that you have realistic expectations going forward and for other languages. Consider this example:

image

The amount of work required to color the two fs at the bottom is effectively the exact same amount of work we'd need to batch typecheck this file. It isn't that we're not clever, it's that semantic highlighting demands this sort of analysis at every position that isn't a direct reference to a provably-bound location.

You mentioned before doing incremental semantic analysis based on invalidating only certain things based on what in the program changed. If we were capable of producing differential semantic outputs, we would have done so years ago. An earlier version of the TS checker tried to maintain meaningful semantic information between edits, and it failed miserably. The flow team at Facebook is trying to do this, and this takes up so much of their time that they had to blog about the fact that they didn't appear to be doing much of anything else.

Other languages that have been designed with static analysis in mind will surely have an easier time of this, but JS is not a language designed with easy static analysis in mind. Maybe semantic highlighting is not a good fit for TS/JS given the efficiency constraints being implied here -- it seems like this feature has been designed with the idea that computing the coloring of a token is nearly free from the provider's perspective, but this just isn't the case for TS/JS. What do you think?

alexdima commented 4 years ago

@RyanCavanaugh You have a very good point. I apologize for speculating without knowing how tsserver actually works. But I spent some time to try and poke at things and understand how they do work. So I have made various tweaks and profiled things. In all cases I insert one space in checker.ts:

I only ran each test once, so the absolute times don't mean too much, since the variation is large, but it helps to get an idea of what is going on.


With semantic highlighting turned off:

ActionTime
updateProjectIfDirty680ms
getSemanticDiagnostics3760ms
TOTAL4440ms
CPU-20200406T174347-sem-off.zip

With semantic highlighting turned on:

ActionTime
updateProjectIfDirty687ms
getEncodedSemanticClassifications4630ms
getSemanticDiagnostics2890ms
TOTAL8207ms
CPU-20200406T174347-sem-on.zip

I find two things interesting: computing semantic tokens is more expensive than typechecking, and it looks like some things (but not all) get cached between calls on the same project state, because getSemanticDiagnostics is now cheaper.


I then went ahead and made one change, to delay the server request from the ts extension in order to run the semantic highlighting after the diagnostics pass, in the hope of benefiting from that caching:

ActionTime
updateProjectIfDirty672ms
getSemanticDiagnostics3250ms
getEncodedSemanticClassifications3560ms
TOTAL7482ms
CPU-20200406T174347-sem-on.zip

This confirms that some things are cached in the typechecker, but computing semantic tokens is still very expensive. I thought that there must be something wrong with the way semantic tokens are computed since even now computing them is slower than computing semantic diagnostics.


I then started looking at the plugin's implementation of getEncodedSemanticClassifications, and it has been done very similar to the one in tsserver itself, it uses the utility ts.forEachChild to walk the AST and stops at identifiers and then pokes around, to find the symbols and types...

I took a peek at the typechecker itself and that is written differently, in a what I understood to be a more traditional "top-down visit each node in a typed manner" approach, with functions which are way less polymorphic than the plugin's very generic visit method, so I started to tweak the plugin's implementation of getEncodedSemanticClassifications.

I began writing a more traditional visitor. I haven't finished yet (I still have nodes left to visit with //TODO as their body), but I'm currently generating 61768 tokens out of 83240 tokens generated by the initial implementation (around 75%) in around 50% of the time, so I have hope that this way of visiting the AST should be a bit faster, even after completion, but it doesn't appear like it will be a factor faster.

ActionTime
updateProjectIfDirty677ms
getSemanticDiagnostics3400ms
getEncodedSemanticClassifications*1640ms
TOTAL5717ms
CPU-20200406T230716-sem-on-new.zip

And here is where things get interesting. A lot of time is actually not spent in the visitor, but again in the typechecker. So while some things are cached, I have a strong suspicion (without really understanding the typechecker itself) that not a lot of things are cached, since I can see in the profiles I'm doing that still a lot of time is spent typechecking while generating the semantic tokens on a project with the same state.

I am not finished, but I wanted to give a quick update on what my thinking currently is. I think if more caching would be done in the typechecker, the semantic tokens generation would be significantly cheaper (maybe an optimistic 10x). This also makes me think about the C++ team who AFAIK generate the semantic tokens while reconciling (they call it the "intelisense pass") to avoid computing things twice.

When I reference in my comments above a dev budget of 1 day, I don't mean that I ask someone from the TS team to invent a new kind of compiler or rewrite all the compiler infrastructure for this feature. I mean that someone spends time because I consider this to be an engineering problem that could be solved with caching and better timing of things. I will continue on this visitor and let you know of the results, but I still believe that someone with deep knowledge of the TS compiler/domain would be much better suited for the task.

alexdima commented 4 years ago

Here is a quick update:

In the end, I was able to reach about 10x overall improvements. Here are the average times to compute semantic tokens for checker.ts on my machine. I have applied 3 conceptional changes to get the improved times:

image

I have compared semantic tokens generated by the initial implementation and by the final implementation, and there are minor differences left in the output, which IMHO, are worth the extra time gained by the symbol cache.

The final change is available at https://github.com/alexdima/typescript-vscode-sh-plugin/tree/alex/faster and can be seen as a diff in https://github.com/aeschli/typescript-vscode-sh-plugin/pull/10

mjbvz commented 4 years ago

Outcome of discussion today:

@aeschli I'm going to unassign myself from this issue and tentatively move it to the next iteration. However if you need help coordinating this work with the TS team, just let me know!

aeschli commented 4 years ago

My understanding is that @aeschli and @alexdima will be driving this work with some help from the TS team.

I was under the impression that someone for the TypeScript team would move the plugin code to the server and adopt it to their liking. Did I misunderstand?

I'm happy maintain the plugin in the meantime and to adopt the TS server when this is ready.

mjbvz commented 4 years ago

@aeschli Yes I think the TS team would do the code changes. You would driving the work by following up with the TS team and providing them with any support they need for this

orta commented 4 years ago

The PR integrating these features into TS is https://github.com/microsoft/TypeScript/pull/39119 👍

aeschli commented 4 years ago

Great news, @orta! As the migration is now well on its way, I'll close this issue.

I'm happy to help anywhere, just ping me!