microsoft / language-server-protocol

Defines a common protocol for language servers.
https://microsoft.github.io/language-server-protocol/
Creative Commons Attribution 4.0 International
10.91k stars 763 forks source link

Provide range or bias direction for go-to-definition etc #1029

Open sam-mccall opened 4 years ago

sam-mccall commented 4 years ago

Go-to-definition (and others) is triggered on a cursor position. In VSCode, the cursor lies between characters, so ambiguity: the thing on the left or the thing on the right?

A couple of factors make this worse:

To illustrate, these screenshots produce the same textDocument/definition request, but want different results.

In vim. This should go to Foo::operator->. image

In VSCode. This should go to object. image

It'd be nice to have some more information so language servers can make better decisions. Some ideas:

I guess this is mostly an issue with languages that have overloaded operators. constructor and functor calls are common examples in C++.

sam-mccall commented 4 years ago

(Just noticed that the highlighting in the second screenshot shows the problem nicely. VSCode's built-in identifier highlighting marks object on lines 2 and 3, while clangd's textDocument/highlight marks operator on line 2 and -> on line 3. I think not including the -> on line 2 is a clangd bug...)

rcjsuen commented 4 years ago

Also see https://github.com/microsoft/vscode-languageserver-node/issues/593.

dbaeumer commented 4 years ago

As mentioned in microsoft/vscode-languageserver-node#593 we need to change the protocol if we need to let servers know about this. Adding a selection to the params might not help since VS Code doesn't have the information right now in the API and other editors might behave the same. So having a capability looks like a good approach to have more control over this.

puremourning commented 4 years ago

In VSCode, the cursor lies between characters, so ambiguity: the thing on the left or the thing on the right?

That's a VSCode issue, not a protocol issue, though right ? The protocol is explicit about which "character" is being requested (it's a row/column position, not a position bettween two columns). Why would we need to change the protocol for this? Surely this change can be made in VSCode only.

Or am i missing something important?

sam-mccall commented 4 years ago

Adding a selection to the params might not help since VS Code doesn't have the information right now in the API and other editors might behave the same

Makes sense to me. A capability seems less complex overall, with most of the complexity lying in the server rather than client or protocol.

The protocol is explicit about which "character" is being requested (it's a row/column position, not a position bettween two columns).

I think it's the opposite:

A position is between two characters like an ‘insert’ cursor in a editor

Whatever the spec text currently says, both "line-cursor" and "block-cursor" editors exist, and it would be nice to specify something here to improve interop.

puremourning commented 4 years ago

I think it's the opposite:

Quite right, ugh.

value represents the gap between the character and character + 1.

Oh boy, i had not noticed that. So, how do you specify the start of the line ? 0 would mean the gap between 0 (the first character) and 1 (the next character), where it's really the gap between the "theoretical" start-of-line and the 0th (first) character.

On reflection, this model seems kinda wonky. For completion requests the character position is usually requested as the position immediately following the last char on the line, but this would indicate that the server should interpret is therefore as to between the "one-past-the-last-char-in-line" and the "two-past-the-last-char-in-line". Fortunately, either we (and most other clients) are not actually not sending the 'one-past-the-last-char-in-line" position or servers are already interpreting the completion position as if it were an actual position (not a virtual thing between 2 positions). I suspect the later.

Whatever the spec text currently says, both "line-cursor" and "block-cursor" editors exist, and it would be nice to specify something here to improve interop.

I see your point. The "between characters" model works for an insert-mode-like situation, but doesn't exist in an equivalent of "normal-mode", or where the cursor has an explicit location.

Aerijo commented 4 years ago

So, how do you specify the start of the line ?

@puremourning Index 0 is the left of the first character

puremourning commented 4 years ago

But offset 0 means the first character in any 0-based offset system i've ever seen.

rcjsuen commented 4 years ago

But offset 0 means the first character in any 0-based offset system i've ever seen.

@puremourning In this case the character does not matter because Position is referring to the space between two characters (or in the extreme case, the start of the line or the end of the line excluding newline characters).

Or perhaps I'm misinterpreting something here?

puremourning commented 4 years ago

The Position in the spec is defined as a 0-based offset of characters in the line:

0123456789
This is a line

The 0th character is T. If servers interpret offset of 0 as the space between character 0 and character 1 then offset 0 is marked here:

0 123456789
T his is a line
 ^

Whereas you're saying it should be interpreted as here:

 0123456789
 This is a line
^

Which i would largely agree is the only reasonable way this could work. But it makes the specification itself ambiguous.

Agree with @sam-mccall that editors where the cursor position is an integral actual-character offset don't have any ambiguity here. in the above cases, position 0 is clearly:

0123456789
This is a line
^
Aerijo commented 4 years ago

Ah, that might be other experience leaking. I've only used line cursors, so think of it in terms of cursor position. The character and the cursor position to the left of it are interchangeable with this interpretation, giving a right bias on the cursor -> character transform. I agree, there doesn't seem to be any issue here then; Position's are specified as being character offsets, so it's up to the client to decide which character (as a Position) it wants to request for. It also makes sense for the client to decide this, to make the experience consistent across servers.

sam-mccall commented 4 years ago

We're drifting a little bit from the original issue :-) I believe the intention is that character is the number of characters to the left of the cursor. So the start of the line is 0 and the end of the line is len(line). AFAIK all editors (of both types) follow this, so while the spec text could be tighter, I don't think there's an actual interop problem here. I'll send a PR for the text.

Aerijo commented 4 years ago

@sam-mccall I'm not sure what's off topic here: the spec says that a Position is the character offset, i.e., the character itself, and that the location for Goto Definition, etc., is given in terms of such a Position. Your original question about ambiguity is resolved with this understanding.

I think the question of how to resolve a given kind of cursor to a character is a concern for the LSP client, not the protocol or the server. IMO the protocol should be view / edit model agnostic as much as possible.

puremourning commented 4 years ago

@sam-mccall you're almost certainly right.

i do think this interpretation is material to this issue though.

The reason i say that is that if the specification states that a position lies between characters, even for ad-hoc requests like go-to, etc. then it's material to the discussion how that position should be interpreted. I think this issue is about helping servers 'guess' what the user's intention is in the model between-characters model, whereas actually if the model is applied consistently, then it should be possible for all clients and servers to interop without new capabilities or protocol changes. I guess i'm exploring if that's possible.

I realise that in the model where you specify the actual character position, there is no ambiguity. i'm thinking that it's the clients responsibility to remove the ambiguity from the user's perspective and the specification's responsibility to remove it from the client/server interpretation perspective.

sam-mccall commented 4 years ago

@Aerijo

the spec says that a Position is the character offset

No, as previously quoted, the spec says: "A position is between two characters like an ‘insert’ cursor in a editor". This is in the section defining the Position structure specifically.

Your original question about ambiguity is resolved with this understanding

That's a perfectly self-consistent theory, but it doesn't match VSCode's selection behavior (see the OP). In practice this basically means the interpretation is wrong - the spec ~always follows VSCode's model, which is also perfectly self-consistent.

@puremourning

if the model is applied consistently, then it should be possible for all clients and servers to interop without new capabilities or protocol changes. I guess i'm exploring if that's possible.

Concretely this might look like "Position now represents a character" and VSCode has to sometimes subtract 1 compared to what it currently sends (assuming no dramatic changes to its internal model, which I think are unlikely). Apart from the question of whether VSCode is willing to make such a potentially-breaking behavior change, this necessarily involves the editor guessing about the significance of the characters on the left vs right of the cursor. This is heuristic, and I've had nothing but trouble where editors have needed to do this (can dig up examples if needed). So my preference is to avoid trying to patch this up.

HighCommander4 commented 4 years ago

Perhaps I'm missing something, but can't this be solved in a general way by having the protocol take a range rather than a position? A between-characters cursor can be modelled as a zero-length range, and an on-a-character cursor as a length-one range.

Using ranges instead of positions also makes the protocol richer, e.g. see the existing discussion in #377 about hovers showing the type of a selected expression that's larger than just an identifier.

Aerijo commented 4 years ago

OK, to be clear, I was basing my reasoning on the spec. I searched "position" in the specification, and the first result was as follows:

A position inside a document (see Position definition below) is expressed as a zero-based line and character offset.

followed by an example explicitly using "character offsets" to refer to characters

So a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16.

The leading sentence of Position confirms this

Position in a text document expressed as zero-based line and zero-based character offset

... and then the second sentence throws everything established so far out completely by saying they measure the space between characters

A position is between two characters like an ‘insert’ cursor in a editor.

...but wait, there's more: in the Range specification:

A range in a text document expressed as (zero-based) start and end positions. A range is comparable to a selection in an editor. Therefore the end position is exclusive.

it says that an end position (represented with a Position) is exclusive. Consider the range [[0, 0], [0, 1]]; do we agree that this is the first character of the first line? If so, then it is exclusive under the character model (doesn't touch the index 1 character), but not exclusive under the between-character model (does touch the index 1 space between characters).

So I hope you understand why I was quite confident about Positions being specified as character offsets, and not spaces between characters :slightly_smiling_face:. Though I do feel better about my original comment now.

It seems then that it is the spec itself that is inconsistent then, probably because it is natural to think about Positions in terms of between characters (to represent an I cursor location / inclusive selection range), and also because they are used to represent characters for things like Goto Definition. I feel the spec is still overwhelmingly in favour of them being character offsets, so I feel my conclusion above is still relevant, but there is definitely inconsistency there.

However your point about needing a heuristic in the editor to decide which position to ask for is interesting; a simpler example might be

  f o o ) b a r
       ^ ^

If using an I beam cursor, the editor would need to pick which way to bias it to a character offset. The user would expect a particular one to be resolved, but there is no way to communicate which to the server without an editor heuristic (e.g., favour word characters over punctuation). Otherwise, if the editor always biases the I beam cursor in a particular direction, it may ask for the ) character (which is probably not useful), despite the user thinking it was part of the end of foo / start of bar.

DavidGoldman commented 3 years ago

Any update here? Seems like we could update the protocol to optionally include a range for selections?

rcjsuen commented 3 years ago

b56a40f8667f54dfda8a180dab52495da4dfb52f seems to imply it's up to the client to decide what to send back if there's a selection.

HighCommander4 commented 3 years ago

b56a40f seems to imply it's up to the client to decide what to send back if there's a selection.

Right, but if the protocol only specifies a point, then the information channel is lossy: the client can decide to send the starting point of the selection or the ending point, but not both.

For some use cases, it would be useful if the protocol allowed the client to send both endpoints of the selection.

dbaeumer commented 3 years ago

https://github.com/microsoft/language-server-protocol/commit/b56a40f8667f54dfda8a180dab52495da4dfb52f is to clarify the current situation. That doesn't mean that we can't extend the protocol with the selection if we think that is necessary.