microsoft / language-server-protocol

Defines a common protocol for language servers.
https://microsoft.github.io/language-server-protocol/
Creative Commons Attribution 4.0 International
11.2k stars 792 forks source link

Add a "data" field to CompletionList for round-tripping data from the server when resolving completion items #1008

Open DanTup opened 4 years ago

DanTup commented 4 years ago

One of the big issues that comes up periodically from LSP clients is how much data is in the textDocument/completion payload when auto-import completions is enabled (that is, we include symbols that are not imported into the current scope, and automatically insert imports for them if they are selected).

Part of the reason this is huge is that we need to round-trip some data to the server in order to be able to build everything we need in the resolve call. We do this using the data field on CompletionItem. Here's an example of some completion items:

    {
        "data": {
            "rLength": 7,
            "rOffset": 848,
            "displayUri": "package:flutter/material.dart",
            "libId": 598,
            "offset": 855,
            "file": "/Users/dannytuppeny/Projects/MyProjectThing/lib/view/page/setting.dart"
        },
        "sortText": "999993",
        "kind": 7,
        "label": "SliverAnimatedList"
    },
    {
        "data": {
            "rLength": 7,
            "rOffset": 848,
            "displayUri": "package:flutter/material.dart",
            "libId": 598,
            "offset": 855,
            "file": "/Users/dannytuppeny/Projects/MyProjectThing/lib/view/page/setting.dart"
        },
        "sortText": "999993",
        "kind": 13,
        "label": "GrowthDirection"
    },
    {
        "data": {
            "rLength": 7,
            "rOffset": 848,
            "displayUri": "package:flutter/material.dart",
            "libId": 598,
            "offset": 855,
            "file": "/Users/dannytuppeny/Projects/MyProjectThing/lib/view/page/setting.dart"
        },
        "sortText": "999993",
        "detail": "(BuildContext context, T value, Widget child) → Widget",
        "kind": 7,
        "label": "ValueWidgetBuilder"
    },
    {
        "data": {
            "rLength": 7,
            "rOffset": 848,
            "displayUri": "package:flutter/material.dart",
            "libId": 598,
            "offset": 855,
            "file": "/Users/dannytuppeny/Projects/MyProjectThing/lib/view/page/setting.dart"
        },
        "sortText": "999993",
        "kind": 7,
        "label": "Texture"
    },

You'll notice here that every completion item (of which there may be a few thousands) has the same data attached to it:

        "data": {
            "rLength": 7,
            "rOffset": 848,
            "displayUri": "package:flutter/material.dart",
            "libId": 598,
            "offset": 855,
            "file": "/Users/dannytuppeny/Projects/MyProjectThing/lib/view/page/setting.dart"
        },

The data here is much larger than the rest of the info for each completion item. This means over 50% of the payload is redundant.

It would be nice if we could include some data on the CompletionList which gets merged into data (or supplied in addition to) on a resolve call, then we could eliminate half of the payload (and along with it, some of them time spent deserialising all of this).

Another option would be for us to stash this on the server and just send an ID back - however without making some assumptions about when it's save to clear up, this could cause issues... If LSP stated that resolve can never be called for a completion item that wasn't in the last set of results, that might fix that.

matklad commented 4 years ago

My gut feeling is that having data: any in CompletionItem is a wrong pattern to begin with. We haven't implemented it for completions yet, but we have something similar for code actions.

The philosophical problem with data is that it is computed at the point where completion item list is requested, but used later, when computing edit. There are no guarantees that the state of the world at the two points is the same, so the server has to explicitly account for the fact that old data does not make sense now. Granted, you can write a spec in such a way as to prohibit DidChangeTextDocument and other changes in between completion and completionItem/resolve requests, but the code to handle possibility of invalid data still has to be there on the server somewhere.

I pretty strongly believe that the best pattern here is stateless requests and responses, roughly like how GET/POST requests work for forms in HTTP. Here, completions would be GET, and completionItem/resolve would be POST.

Specifically, the completionItem/resolve should look like this:

{
  "params": CompletionParams,
  "item": CompletionItem,
}

That is, the client sends exactly the same data as for completion request, plus the selected item. Given the same params, the server, by definition, can reconstruct any state it used to compute the original completion list. It then can skip computing other completions, and compute the edit only for the selected one.

Now, even in this setup it might be helpful to round-trip some kind of completionItemId: string to make it easier for the server to compute which specific completion item it needs to resolve, but this should be restricted solely for identifying stuff. All the actual data should be re-derived by the server, using the up to date state.

DanTup commented 4 years ago

I don't really see a difference between your proposal and mine, except it seems more restrictive (you get the original params, even if you'd rather have some pre-computed state). You can implement yours using mine (or even today, but you'll hit the issue I have here - which is huge duplication).

I think it's better to remain flexible - if you want to round-trip the original request, you can do it. If you want to instead use some computed values that will make resolve faster, you can do it. This request is entirely about performance though - what we have today may require a ton of duplication (over 50% of the payload for me), so having a shared/global blob of data that's round tripped would cut that significantly.

matklad commented 4 years ago

Yeah, I agree that we are proposing essentially the same thing, I just have a different argument why it is a good idea :-)

Though, I do believe that "don't save state across LSP requests" is a good guide to encourage, and that's why I think defaulting to just request params might be a better default. But I don't have a too strong opinion on this specific point

mickaelistria commented 4 years ago

See also #1009 and related comments. What are the workflows requiring such extra data fields? Couldn't we build something more strongly spec'd and typed for those workflows instead of adding more and more unspec'd, unclear, expensive to integrate in each clients placeholders for random server-specific content here and there?

matklad commented 4 years ago

more and more unspec'd, unclear, expensive to integrate in each clients placeholders for random server-specific content here and there?

This doesn't apply in this case, as data is just passed through from the server, via client, back to the server.

DanTup commented 4 years ago

Though, I do believe that "don't save state across LSP requests" is a good guide to encourage,

I agree - my request here isn't to support that (and we're not doing that). I did mention it in the last paragraph as a fallback if this wouldn't otherwise be implemented (but it would need to spec to specify some terms for this, so it's not really viable now).

and that's why I think defaulting to just request params might be a better default

I'm not sure of the relationship between these. I think allowing any data supports this, but also supports round-tripping custom/pre-computed data if you want. I think specifying it as the original params would be an unnecessary (and unwanted, for me) restriction.

Couldn't we build something more strongly spec'd and typed for those workflows instead of adding more and more unspec'd, unclear, expensive to integrate in each clients placeholders for random server-specific content here and there?

The purpose of data is to allow the server to attach some additional info to a completion that it may need in order to resolve it efficiently. The client. never does anything with it except send to back to the server, so I don't think there's any benefit to trying to type it (and it would place additional restrictions on the server).

mickaelistria commented 4 years ago

OK, so IMO, it'd be better to have this data be an integer serverDataToken, and server storing the data on its side and resolving it on its side (IIRC similarly to what's implemented for data navigation in Debug Adapter Protocol). With this approach, 1. there is no performance overhead in transferring unnecessary content and 2. it seems less open and less likely to be exploited as a way to host extra server<->client payload.

matklad commented 4 years ago

my request here isn't to support that

I mean, "data": { "rLength": 7, "rOffset": 848, ... } is effectively an extra state which is "stored" between reqeusts. The rOffest looks like it might get invalidated by some interleaving change between completion & resolve request. To me, the approach where the server re-derives the rOffest itself seems more obviously correct. But I agree that this is mostly a theoretical concern, just sticking offsets or what not in data should work well-enough.

DanTup commented 4 years ago

IMO, it'd be better to have this data be an integer serverDataToken, and server storing the data on its side and resolving it on its side

That's what the last paragraph in the original comment here suggested - though as a fallback rather than the ideal solution. The problem with that is invalidation - when can the server throw the data away? Can we assume when we get a new request for completion, that we can throw away all of the previous ones?

I mean, "data": { "rLength": 7, "rOffset": 848, ... } is effectively an extra state which is "stored" between reqeusts.

Sorry, I thought you meant the server being stateful. The point of round-tripping this is to avoid the server holding state. It's true that it could be used to pass around data that could become invalid, but that's entirely up to the server. I don't think LSP should enforce restrictions just because it's possible for servers to use it to create bugs if it would enforce unnecessary restrictions (for example the displayUri in this example is not invalidated, as it's the dispay URI for the library that we will import from which we use when building the documentation/tooltip in resolve).

mickaelistria commented 4 years ago

Can we assume when we get a new request for completion, that we can throw away all of the previous ones?

It could be spec'd that the data is supposed to be retained and query-able until next file change, deletion or close action. That's somehow how DAP does that: when the state of what's under supervision (execution stack for DAP, it would be text buffer for LSP) changes, tokens become invalid and unresolved.

rwols commented 4 years ago

You'll notice here that every completion item (of which there may be a few thousands)

You could also reduce wire traffic by sending about ~200 items and setting isIncomplete: true. The user is not going to browse through thousands of items.

DanTup commented 4 years ago

You could also reduce wire traffic by sending about ~200 items and setting isIncomplete: true.

Yep, but there are some drawbacks to this - like the user not being able to scroll through the full list and higher latency while typing when the server is remote (Codespaces, GitPod, etc.). It seems like we could have the best of both worlds and not have to make this tradeoff if the API was improved a little.

rwols commented 4 years ago

I see what you mean. I agree with this proposal but it should be guarded by a client capability.

like the user not being able to scroll through the full list

I am not convinced this is actually a problem :) These large completion lists generally occur when typing a single character in some global context. The user usually has an idea about what he or she is about to type, so after a re-requesting a few more times when more characters are typed, the list should "eventually" converge to a complete set of items. But I don't have any experience with GitPod.

Another option would be for us to stash this on the server and just send an ID back - however without making some assumptions about when it's save to clear up, this could cause issues... If LSP stated that resolve can never be called for a completion item that wasn't in the last set of results, that might fix that.

This would also be a solution. I haven't come across an editor that can show more than one auto-complete widget. It is perhaps possible that some client implementations don't drop old completion lists (in the widget) when new completion lists arrive. Can you give an example of an editor where this assumption doesn't hold? I suppose you could enable this behavior with a boolean in initializationOptions.

The problem with that is invalidation - when can the server throw the data away? Can we assume when we get a new request for completion, that we can throw away all of the previous ones?

I can only speak for Sublime Text. Because basically all servers have the behavior that new lists overwrite old lists (per document), we don't store/present old lists (per document). Because of this, The params for a completionItem/resolve request can only contain an item from the most recent textDocument/completion. Perhaps adding the sentence

The params for a completionItem/resolve request can only contain a completion item from the most recent textDocument/completion request (per document)

in the spec would guarantee your idea? And it would require little to no client code changes.

DanTup commented 4 years ago

I agree with this proposal but it should be guarded by a client capability.

For compatibility reasons it would have to be. It's in a clients interest to implement it for better performance though.

like the user not being able to scroll through the full list

I am not convinced this is actually a problem :)

Maybe - but when we're sending symbols that haven't been imported yet (a much desired feature) the list can be large so picking how many to send is difficult (if you send too few, the user may have typed all they know, but the item they want is still not in the list.. if you send too many, there's lots of traffic on every keypress).

But latency was really the thing I cared about most. When a user is typing quickly, the longer it takes to filter the completion list as they're typing, the worse the experience is. Gitpod/Codespaces/etc. are growing popular - there are going to be many users where the completion list is coming over the internet (also LiveShare transfers the full completion list over the web when invoked by the client).

Another option would be for us to stash this on the server and just send an ID back

This would also be a solution. [...] Can you give an example of an editor where this assumption doesn't hold?

I can't - and it would probably work - but it feels like a workaround rather than a solution. If we're going to rely on state that's kept on the server, it should be explicitly part of the spec and not just some assumption we hope holds true for all clients.

Just off the top of my head, I could see LiveShare supporting multiple users invoking completion at different points of the document at the same time (each list may be rendered only on their own local editors, but they're both being serviced by the same LSP server at the same time).

"The params for a completionItem/resolve request can only contain a completion item from the most recent textDocument/completion request (per document)"

in the spec would guarantee your idea? And it would require little to no client code changes.

It would - but it may be overly restrictive (for example preventing the LiveShare example above).

rwols commented 4 years ago

Your LiveSession example sounds convincing! OK, I suppose we cannot rely on previous request state.

michaelpj commented 3 years ago

I strongly agree with @matklad. Without keeping lots of state on the server (with the aforementioned awkward lifetime issues), the obvious data to pass through is the entire request parameters. But then you run into precisely the issue described at the top: you repeat the entire, bulky, request parameters each time.

What @matklad suggests would solve this elegantly and generically: a resolve request says "please run this again (here are the parameters), but this time I can tell you I'm only interested in something that looks like this, and I want all the details".

DanTup commented 3 years ago

Without keeping lots of state on the server (with the aforementioned awkward lifetime issues), the obvious data to pass through is the entire request parameters

In my case, that forces the server to re-compute the completion items, rather than just being able to use the data it's already computed. My solution does not require storing any state on the server - that's the whole point - I want to round trip it. I'm already round-tripping it, I just want to avoid duplicating it on every single completion item. My payload is more than double the size it needs to be.

Supporting data on CompletionList supports all of the things suggested above (you can manually put request params back in it, you can put a "server data token", or you can put the data in directly - as I'd like to). I'm not sure I really understand the reasons to make it more restrictive - the "generic" implementation allows you to use it however works best for your server.

dbaeumer commented 3 years ago

Providing the params has another issue. If the list was complete the user could have typed. However the params used would be the once when requesting the completion list hence not reflecting the current state. This is currently expressed by passing the item and not the params.