Make positions offset-based

dragos commented 8 years ago

I don't know if this is the right place to discuss it, but I couldn't find another channel.

The current representation of positions as line/column pairs is cumbersome for some implementations. For example, the Scala compiler uses offsets to refer to positions, and I'm sure this is not the only compiler doing so. It has many advantages: it's just one number, ordering is natural, slicing source code is a simple operation on a char array, etc. Line and column representation is suited for the UI but not so much in the implementation.

Conversions are cumbersome too, because one needs the source file contents at the point of conversion. Moreover, taking an offset position back to line/col can be expensive and requires linear search in the naive implementation (or some caching and a binary search through a line array to be efficient). I imagine this is already handled efficiently inside the client, so why inflict this on servers (and risk suboptimal behavior)?

I understand this would be a breaking change and probably won't happen, but I'd like to understand the reasoning behind this representation and how others are dealing with it.

EDIT: ~~What about tab characters? The column will reflect the number set in the UI, but the offset representation counts just one position for \t.~~. The protocol uses character, which is 1 for \t. Might be worth clarification.

For these reasons, should the protocol support an offset-based position as well?

nadako commented 8 years ago

I second, this is a bit annoying to work with. For example, Haxe compiler also works with offsets, so we have this thing to provide conversion between single offset and character/line. I think this was originally copied from microsoft's language-server-node implementation and adapted to our needs.

PS What's even worse is that Haxe expects byte offset, so we have to convert between utf-8 and bytes, but I can't blame lang protocol for that and I personally think it's fine to require character-based positions instead of byte-based and hope this will be changed in Haxe itself.

mickaelistria commented 8 years ago

Eclipse also use offset-based by default. The number of times we invoke utilities to convert those line/colums to offset and vice-versa are countless. But it's not really a perceivable issue in the code nor in the performance.

hedefalk commented 8 years ago

+1

Asked about this on SO, but :tumbleweed: so I guess this is a better channel :)

http://stackoverflow.com/questions/39149525/how-do-i-best-resolve-an-offset-position-into-a-line-col-position-in-a-vscode-d

Atom has the same problem too: https://discuss.atom.io/t/how-to-get-cursor-position-as-buffer-offset-one-dimensional-in-chars/15892

The standard API:s should be able to handle one-dimensional offsets for performance reasons. I mean, many API points take position for doing stuff (like the mentioned DefinitionProvider api), but it shouldn't be up to the provider to resolve this since it means they have to separately open files from disk just to do the calculation. Better to leave this is up vscode to do as late as possible and only once.

felixfbecker commented 8 years ago

Just wanted to share our implementation as well: https://github.com/felixfbecker/php-language-server/blob/master/src/NodeVisitor/ColumnCalculator.php If anyone has some hints whether this could be improved performance/complexity-wise...

I don't feel this is such a big issue. We calculate the columns once after parsing. I imagine that lines/columns is semantically closer to how the user perceives it and might be easier when working with TextEdits etc.

egamma commented 8 years ago

@felixfbecker VS Code's editor is using prefix sums, pls see https://github.com/Microsoft/vscode/blob/master/src/vs/editor/common/viewModel/prefixSumComputer.ts for the implementation.

CC @alexandrudima

felixfbecker commented 8 years ago

I guess the reasoning is imagine a range from line 1 - 3. If the user is typing in line 2, end column will stay the same. The file end pos would constantly update.

dragos commented 8 years ago

A small clarification: I understand the two representations are (roughly) equivalent, but this becomes an issue when you have a compiler front-end you want to reuse for a given language and the representation in that compiler is offset-based. It leads to keeping around three copies of each source file under edit:

one in the editor
one in the compiler
one in the shim that transforms positions

felixfbecker commented 8 years ago

@dragos Not necessarily. In my case we read the content, parse it, then calculate and save all the columns. After that we don't need the content anymore and neither the "shim"

dragos commented 8 years ago

@felixfbecker I may be missing something, but if you have an already existing compiler that is offset-based, and you want to ask for, say, completions as a given line/col. How do you transform that in the language server to an offset-based position to pass to the compiler?

felixfbecker commented 8 years ago

I don't pass offsets to the compiler. I compile the whole AST once and calculate all columns of all nodes once. Then I can work with those values.

bruno-medeiros commented 8 years ago

I may be missing something, but if you have an already existing compiler that is offset-based, and you want to ask for, say, completions as a given line/col. How do you transform that in the language server to an offset-based position to pass to the compiler?

@dragos Whenever the source is updated, you can create an auxiliary data structure that maps from lines to offsets, see for example: https://github.com/bruno-medeiros/MelnormeEclipse/blob/master/plugin_tooling/src-lang/melnorme/lang/tooling/parser/SourceLinesInfo.java

SourceLinesInfo provides essentially the same info as the prefixsum of VSCode (https://github.com/Microsoft/vscode/blob/master/src/vs/editor/common/viewModel/prefixSumComputer.ts), but with the shortcoming it can't be updated dynamically, it is immutable once constructed. (Note: essentially, prefixsum = offset for a given line index)

dragos commented 8 years ago

@bruno-medeiros that was not a general question, as I mentioned in the original post, I know this can be solved. I just don't think this kind of position handling should be done on both the client and the language server, for every language server that's using offsets.

dragos commented 8 years ago

Another tricky question: how many columns do you count for a tab character? In the offset representation \t is just one position. However, columns will reflect the editor setting for tab spaces.

felixfbecker commented 8 years ago

@dragos The protocol uses the term character and I am pretty sure it uses 1 character for a tab.

dragos commented 8 years ago

@felixfbecker you are right, my misunderstanding.

nicot commented 8 years ago

What if an editor runs into a situation like this: http://stackoverflow.com/questions/30727515/why-is-executing-java-code-in-comments-with-certain-unicode-characters-allowed? The rest of the file will not have (useful) code intelligence.

If we use row/col addressing, an editor must have the exact same interpretation of the file as the language server. This includes character encoding and understanding of newlines.

The way I see it, there a few downsides of byte offset:

Would need to recompute every following position in the AST on every modification.
Harder to understand and debug.

The benefits of using byte offset would include:

Incremental reads at the position.
Universal and unambiguous interpretation.

dbaeumer commented 8 years ago

Most of the arguments given would apply to the client side if the protocol speaks offset. If for example you execute 'find all references' and want to show that in the user interface showing a position inside the document would be very confusing. A user would expect line/character (or column).

So at the end of the day the conversion has to happen somewhere. Historically the LSP is driven by the 'tool' / UI side and not from a language / AST / symbol side (and therefore from a compiler view). This is why we have something like textDocument/documentSymbols and not a symbol representation for a text document which we can then walk for its children on the client side and extract thing we want to show in the UI. (for example a JavaSource file symbol with children for the import statements and the primary top level class).

So I am reluctant to change this since it will not make the conversion go away. However I agree that we need to specify a couple of more things:

what are line termination sequences
information about the file encoding. This is necessary independent of the position / line/character discussion since a server is allowed to read file contents from disk for files which aren't open in the editor. And for these files we to be able to let the server know the encoding to use during reading.

jinmingjian commented 7 years ago

I meet the same problem. The LS creators may not control the compiler. For this case, the conversion logic is highly undesired.

Because they need to access the file system. For performance, they may need to cache the documents. But all of these logics have existed in the client. All IDE-like projects put much effort at this. It should not reinvent one bad wheel.

even thoough "Historically the LSP is driven by the 'tool' / UI side" LS, it may be better to minimize the coupling if the LS or parts of LS can be done without UI effort as possible.

So, I suggest add an option for "offset". Let the client help the LS if it want.

dragos commented 7 years ago

Most of the arguments given would apply to the client side if the protocol speaks offset. If for example you execute 'find all references' and want to show that in the user interface showing a position inside the document would be very confusing. A user would expect line/character (or column).

@dbaeumer That would be true only if the client relied exclusively on line/col and never on offsets. If I understand @egamma's comment, even VS Code is using offsets too, and uses prefix sums to perform conversions efficiently. However, you can perfectly fine have a language server using exclusively an offset-based representation, as testified by several participants in this thread. So, given all this, my take is that the logical choice is to keep the conversion on the side that already has to do those conversions rather than require it on both sides of the fence.

Sure, conversions can be done everywhere, but as for any protocol we should worry about the most efficient way of achieving its goals.

dbaeumer commented 7 years ago

VS Code relies on line/col for all its API and internal state. We use prefix sums to effectively update the data structures if the user types in the editor.

I agree that an offset based language server would work as well as a line/col based does. As said we made the decision from a UI perspective (e.g. like a compiler outputs problem information line/col based and not offset based :-)).

And for most of the features the server reads the file from disk to compute results. If the client would need to do the conversion it would need to reread the file to do the conversion to present the result properly (line/col based) in the user interface.

I am still reluctant to add a capability saying that the both ends support offset since it will make the overall code more complicated (a server and client would still need to support both).

JayantNayak commented 7 years ago

Currently I am facing this offset based conversion issue. We already had a client based Web Ide for which we had written the client side intellisense code. And we are currently migrating it to a LSP based implementation ,where we are trying to reuse the already availabe javascript code. As all of our coding was based on single dimension offset base I have to now write my own custom offset converter , so that I am able to reuse the code.

It would have been nice if some library or API was available in to convert the 2D offset to 1D character based offset and vice versa as this is a basic need.

gritzko commented 6 years ago

Pardon my ignorance, is a language server supposed to run locally or on the same LAN? In case of a wide-area network, offsets may change while the server is responding. The row/column approach is even more vulnerable to that.

felixfbecker commented 6 years ago

@gritzko that is irrelevant, LSP requires messages to be delivered in order, so the request is always valid to the current state of the server.

gritzko commented 6 years ago

@felixfbecker I meant, by the time the client receives a response, the offsets have already changed several times over. I know that cloud-hosted spell checkers had to address this issue.

felixfbecker commented 6 years ago

If you request the definition or references on a symbol, it doesn't matter if that file changes before the response arrives. A completion result can also still be used if the client wishes. If the result is invalid because of edits, the client can ignore the response and/or cancel the request and resubmit it (after updating the server with the latest content).

gritzko commented 6 years ago

If I understand it correctly... while the user is typing, SymbolInformation ranges will be off most of the time. Naturally, the next step is to implement some limited version of operational transforms to compensate for that gap. Which might be easier to do with offsets.

felixfbecker commented 6 years ago

Symbol requests return a snapshot of the symbols at the time of the request. The client is expected to repeat the request to update. Yes, in a high-latency system it might make sense to implement OT on the client. But the 99% use case for LSP is to run on the same machine over STDIN/STDOUT. FWIW, LSP is powering all our code intelligence on https://sourcegraph.com and we never ran into a problem with latency that would be solved by implementing OT.

gritzko commented 6 years ago

FWIW, LSP is powering all our code intelligence on https://sourcegraph.com and we never ran into a problem with latency that would be solved by implementing OT.

Naturally, that's on-prem with no mobile clients. Either way, 99% is the answer to my question.

felixfbecker commented 6 years ago

Naturally, that's on-prem with no mobile clients.

That's not entirely true. LSP is used all the way from the browser over an LSP backend proxy to language servers running in containers. Most developers use laptops, many tether while on the train or have bad WiFi at some conference or Starbucks. Performance in these cases is still important.

mattacosta commented 6 years ago

Just a bit of cross-referencing: Microsoft/vscode#2811 (all the way back from Feb 2016!)

I just started updating some code from back then, and can't believe that this is still a thing. Long story short, offsets are an absolute position (i.e. relative to an origin point) while a line position is relative to every single previous line break. This means that you can always use an offset under any condition, while a line position will always require some previous point of reference.

As previously mentioned, from a client/server perspective, both methods need encoding information, and line positions require both sides to agree on what a line break is.

However, you can perfectly fine have a language server using exclusively an offset-based representation, as testified by several participants in this thread. So, given all this, my take is that the logical choice is to keep the conversion on the side that already has to do those conversions rather than require it on both sides of the fence.

I also want to reiterate this point. It seems to me that while you guys have done a great job protecting the client from language stuff, you have done a poor job protecting the server from what is essentially a UI component. If we can't get this for existing features any time soon, can we at least get it for new features?

Edit:

@dbaeumer

And for most of the features the server reads the file from disk to compute results. If the client would need to do the conversion it would need to reread the file to do the conversion to present the result properly (line/col based) in the user interface.

I keep seeing comments like this and I believe in another issue you also mentioned an example of this being references in files not opened by the client, but are there any examples of incoming data requiring a line position format? I ask because there may need to be a distinction for features that a server has opted-in to; outgoing data in a line position format is perfectly fine, if necessary.

dbaeumer commented 6 years ago

I fully agree that from a technical perspective it doesn't matter whether we use offset-based or line/column based positions. That being said I am against starting to mix these models in any way, even for new features. We should keep the LSP in this regards consistent.

Another example are diagnostics (errors and warnings). All compilers I know issue diagnostics in line / column format so every language scanner / parser more or less has this information (and some knowledge about line endings). Since most LSP features build on compiler technologies it seems OK for me to let the server compute the results in line / column format.

Avi-D-coder commented 5 years ago

From a technical perspective if document offsets were used all servers and clients would need to agree on the encoding unit. Right now when servers and clients disagree on the unit of column offsets only the line where the disagreement takes place has incorrect offsets. If the spec used an offset into the whole document then one astral or non ascii char would cause offsets in the rest of the document to be misunderstood. The majority of non MS lsp implementations currently disagree on encoding/char offset unit, so global offsets seem untenable.

Additionally I don't know of a single editor that provides a document offset based API. Many clients try to be as thin as possible due to being written slower in languages and heavily utilizing editor APIs. This is true to the point where they don't support incremental sync, and don't convert column offsets to UTF-16, since doing so would require storing an extra copy of the document. Servers are in contrast, usually written in real non editor specific languages and often hold a copy of the document already. Even if most compilers used document offsets given the thick server thin client model it would make more sense that the conversions be handled in the server.

see: https://github.com/Microsoft/language-server-protocol/issues/376

puremourning commented 5 years ago

That being said I am against starting to mix these models in any way, even for new features. We should keep the LSP in this regards consistent.

👍 yep, like the debate on encoding, just pick one and everyone should conform to the spec. There are always trade-offs in engineering, but having multiple competing specifications is usually the worst (but sadly the most common) result.

mattacosta commented 5 years ago

@dbaeumer Better late than never, but when I say incoming and outgoing, I mean relative to the server. Diagnostics are NOT an example of incoming data in this case.

As far as I am aware, there are no examples where an incoming request must be made using line-based positions because all requests that need position data also require the client (or its surrogates) to have opened the file at some point, thereby making offset-based positions available.

Outgoing responses/notifications (like diagnostics or folding ranges) on the other hand, do not require the client to have opened the file, and may still use line-based positions.

@puremourning Make no mistake, one has already been picked, which is why this issue is over 3 years old with no sign of a change ever happening. Realistically, the workaround is so easy that this issue will never get enough +1s to ever be worth implementing and might as well just be closed. :(

felixfbecker commented 5 years ago

As far as I am aware, there are no examples where an incoming request must be made using line-based positions because all requests that need position data also require the client (or its surrogates) to have opened the file at some point, thereby making offset-based positions available.

One counter example: The Sourcegraph browser extension provides hovers and go to definition on pull requests on GitHub and other code hosts through LSP WebSocket connections to language servers. It is easy to find out the line that was hovered from the line number in the DOM and the column by counting the characters from the beginning of the line. But finding out the offset from the beginning of the file however would not be trivial at all because the browser extension doesn't have easy access to the full file contents (it would have to do an API request, which means it would need API authentication). Line and column are much simpler for this use case.

rcjsuen commented 5 years ago

Additionally I don't know of a single editor that provides a document offset based API.

Eclipse's IDocument interface is based on document offsets (unless I'm misunderstanding something here).

bruno-medeiros commented 5 years ago

Additionally I don't know of a single editor that provides a document offset based API.

Eclipse's IDocument interface is based on document offsets (unless I'm misunderstanding something here).

Yes but it also provides API's to work with line offsets: getLineOffset, getLineOfOffset, getLineLength etc. , so you can convert either way. The primary API is still document-based offset (so ranges, deltas, etc, are expressed in that format)

mickaelistria commented 5 years ago

From Eclipse IDE perspective, yes, most APIs are offset relative and it's pretty nice in the vast majority of cases. And the existing IDocument and StyledText API offer easy methods for conversion between offset and line/column pairs. It's not that hard to go from one to the other, and LSP4E has actually implemented some trivial methods doing that. I think the LSP is now mature enough to avoid changing this kind of things. Many clients (including Eclipse IDE which is naturally offset-based) have managed to work flawlessly despite the convention chosen in LSP being different. To me, this issue is not relevant any more and there are many examples and legacy showing that 1. it's not necessary and 2. changing it now would require an effort. So ROI of changing this would be negative for the servers/clients. I'd be in favor of closing this ticket as "Won't fix".

dbaeumer commented 5 years ago

I agree with @mickaelistria conclusion. Will close the issue as won't fix.

microsoft / language-server-protocol

Make positions offset-based #96