Change character units from UTF-16 code unit to Unicode codepoint

MaskRay commented 6 years ago

Text document offsets are based on a UTF-16 string representation. This is strange enough in that text contents are transmitted in UTF-8.

Text Documents
......... The offsets are based on a UTF-16 string representation.

Here in TextDocumentContentChangeEvent, range is specified in UTF-16 column offsets while text is transmitted in UTF-8.

interface TextDocumentContentChangeEvent {
    range?: Range;
    rangeLength?: number;
    text: string;
}

Is it more reasonable to unify these, remove UTF-16 from the wording, and use UTF-8 as the solely used encoding? Line/character can be measured in units of Unicode codepoints, instead of UTF-16 code units. A line cannot be too long and thus doing extra computing to get the N'th Unicode codepoint would not lay too much burden on editors and language servers.

https://github.com/jacobdufault/cquery/issues/57

Survey: counting method of Position.character offsets supported by language servers/clients https://docs.google.com/spreadsheets/d/168jSz68po0R09lO0xFK4OmDsQukLzSPCXqB6-728PXQ/edit#gid=0

szatanjl commented 6 years ago

I would suggest to go even one step further. Why editors and servers should know which bytes form a Unicode codepoint. Right now specification states it supports only utf-8 encoding, but with Content-Type header I guess there is an idea of supporting other encodings in the future too. I think it would be even better then to use number of bytes instead of UTF-16 code unit or Unicode codepoint.

dbaeumer commented 6 years ago

@MaskRay we need to distinguish between the encoding used to transfer the JSON-RPC message. We currently use utf-8 here but as the header indicates this can be change to any encoding assuming that the encoding is supported in all libraries (for example node per default as only a limited set on encodings).

The column offset in a document assumes that after the JSON-RPC message as been decoded when parsing the string document content needs to be stored in UTF-16 encoding. We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support.

If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces.

Regarding byte offsets: there was another discussion whether the protocol should be offset based. However the protocol was design to support tools and their UI a for example a reference match in a file could not be rendered using byte offsets in a list. So the client would need to read the content of the file and convert the offset in line / column. We decided to let the server do this since the server very likely has read the file before anyways.

Marwes commented 6 years ago

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8.

Source? Isn't the only reason for this is that Java/Javascript/C# uses UTF-16 as their string representation? I'd say there is a good case to made that (in hindsight) UTF-16 was a poor choice for string type in those language as well which makes it dubious to optimize for that case. The source code itself is usually UTF-8 (or just ascii) and as has been said this is also the case when transferring over JSON-RPC so I'd say the case is pretty strong for assuming UTF-8 instead of UTF-16.

puremourning commented 6 years ago

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8. To save one encoding pass we could transfer the JSON-RPC message in UTF-16 instead which is easy to support.

Citation needed? ;)

Of the 7 downstream language completers we support in ycmd:

1 uses byte offsets (libclang)
6 use unicode code points (gocode, tern, tsserver, jedi, racer, omnisharp)
0 use utf 16 code units

* full disclosure, I think these use code points, else we have a bug!

The last is a bit of a fib, because we're integrating Language Server API for java.

However, as we receive byte offsets from the client, and internally use unicode code points, we have to reencode the file as utf 16, do a bunch of hackery to count the code units, then send the file, encoded as utf8 over to the language server, with offsets in utf 16 code units.

Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets. I don't know for certain all of them, but certainly the main one (Vim) is not able to provide utf 16 code units; they would have to be calculated.

Anyway, the point is that it might not be as simple as originally thought :D Though I appreciate that a specification is such, and changing it would be breaking. Just my 2p

puremourning commented 6 years ago

Not that SO is particularly reliable, but it happens to support my point, so I'm shamelessly going to quote from: https://stackoverflow.com/questions/30775689/python-length-of-unicode-string-confusion

You have 5 codepoints. One of those codepoints is outside of the Basic Multilingual Plane which means the UTF-16 encoding for those codepoints has to use two code units for the character.

In other words, the client is relying on an implementation detail, and is doing something wrong. They should be counting codepoints, not codeunits. There are several platforms where this happens quite regularly; Python 2 UCS2 builds are one such, but Java developers often forget about the difference, as do Windows APIs.

MaskRay commented 6 years ago

Emacs uses some extended UTF-8 and its functions return numbers in units of Unicode codepoints.

https://github.com/emacs-lsp/lsp-mode/blob/master/lsp-methods.el#L657

@vibhavp for Emacs lsp-mode internal representation

szatanjl commented 6 years ago

I am sorry in advance if I am telling something stupid right now. I have a question to you guys.

My thought process is that if there is a file in different encoding than any utf, and we use other encoding than utf in JSON-RCP (which can happen in future) then why would there be any need for the client and server to know what Unicode is at all?

Of the client implementations of ycmd (there are about 8 I think), all of them are able to provide line-byte offsets.

That's it. It is easy to provide line-byte offset. So why would it be better to use Unicode codepoints instead of bytes?

Let's say for example we have file encoded in iso-8859-1 and we use the same encoding for JSON-RPC communication. There is a character ä (0xE4) that can be represented at least in two ways in Unicode: U+00C4 (ä) or U+0061 (a) U+0308 (¨ - combining diaeresis). Former is one unicode codepoint, latter is two, and both are equally good and correct. If client uses one and server another we have a problem. Simply using line-byte offset here we would avoid these problems.

@dbaeumer I think we misunderstood each other or at least I did. I didn't mean to use byte offset from beginning of the file which would require client to convert it but to still use {line, column} pair. But count column in bytes instead of utf-16 code units or unicode codepoints.

eonil commented 6 years ago

We choose UTF-16 encoding here since most language store strings in memory in UTF-16 not UTF-8.

If we want to support UTF-8 for internal text document representation and line offsets this would be a breaking change or needs to be a capability the client announces.

Are you serious? UTF-16 is one of worst choice of old days due to lack of alternative solutions. Now we have UTF-8, and to choose UTF-16, you need a real good reason rather than a very brave assumption on implementation details of every softwares in the world especially if we consider future softwares.

This assumption is very true on Microsoft platforms which will never consider UTF-8. I think some bias to Microsoft is unavoidable as leadership of this project is from Microsoft, but this is too much. This reminds me Embrace, extend, and extinguish strategy. If this is the case, this is an enough reason to boycott LSP for me. Because we gonna see this kind of Microsoft-ish nonsense decision making forever.

kdvolder commented 6 years ago

Just to be clear, I don't work for Microsoft, and generally haven't been a big fan of them (being a Linux user myself). But I feel compelled to defend the LSP / vscode team here. I really don't think there's a big conspiracy theory here. From where I stand, it looks to me like Vscode and LSP teams are doing their very best to be inclusive and open.

The UTF-8 vs UTF-16 choice may seem like a big and important point to some, but to others, including myself, the choice probably seems somewhat arbitrary. For decisions like these, it is natural to write into the spec something that confirms to your current prototype implementation for choices like these, and I think this is perfectly reasonable.

Some may think that is a mistake. As this is an open spec and subject to change / revision/ discussion, everyone is free to voice their opinion and argue what choice is right and whether it should be changed... but I think such discussions should stick to technical arguments there's no need to resort to insinuations of a Microsoft conspiracy theory (moreover, these insinuations are really unwarranted here, in my opinion).

eonil commented 6 years ago

I apology for involving my political view in my comment. I was over-sensitive due to traumatic memory from Microsoft in old days. Now I see this spec is in progress and subject to change.

~I didn't mention technical reasons because these are mainly repetition of other people's opinion or well known. Anyway, I list my technical reasons here.~

~IMO, UTF-8 is present and future, and UTF-16 is legacy to avoid. The reason is here.~
~By requiring dependency to UTF-16, LSP effectively forces program implementation to involve the legacy.~
~Simplicity is better than extra complexity and dependency. One encoding for everywhere is better.~
~More complexity and dependency increases amount of work of implementation a lot.~
~AFAIK, Converting indices between different Unicode encodings are very expensive.~
~LSP is a new protocol. No reason to involve a bad legacy. The only benefit here is potential benefit to specific platforms with native UTF-16 native strings.~
~For now, the only reason to require UTF-16 is to give such benefit to specific implementations.~
~Other platforms wouldn't be very happy due to increased complexity and potential performance penalty in implementation.~
~Such unfair benefit is likely going to break community.~

... or needs to be a capability the client announces

I think this is fine. An optional field which designates encoding mode of indices beside the index numbers. If the encoding mode is set to utf-8, interpret the numbers as UTF-8 code points, and if it is utf-16, interpret them as UTF-16 code point. If the field is missing, fallback to UTF-16 for legacy compatibility.

sam-mccall commented 6 years ago

This is causing us some implementation difficulty in clangd, which needs to interop with external indexes. Using UTF-16 on network protocols is rare, so requiring indexes to provide UTF-16 column numbers is a major yak-shave and breaks abstractions.

udoprog commented 6 years ago

Yup, same problem here working on reproto/reproto#34.

This would be straight forward if "Line/character can be measured in units of Unicode codepoints" as stated in the original description.

dbaeumer commented 6 years ago

As mention in one of my first comments this needs to be backwards compatible if introduced. An idea would be:

client announces the encodings it supports for position encodings.
server picks an encoding to use.

If no common encoding can be found the server will not functioning with the client. So at the end such a change will force clients to support the union set of commonly used encodings. Given this I am actually not sure if the LSP server Eco system will profit from such a change (a server using an encoding not widely adopted by clients is of limited use from an Eco system perspective). On the other hand we only have a limited number of clients compared to a large number of servers. So it might not be too difficult to do the adoption for the clients.

I would appreciate a PR for this that for example does the following:

make the necessary changes to the spec
does the changes to the npm protocol library (https://github.com/Microsoft/vscode-languageserver-node/tree/master/protocol)
implements another supported position encoding for a client. For example utf-8 for the VS Code client (https://github.com/Microsoft/vscode-languageserver-node/tree/master/client) 😀

jclc commented 6 years ago

What about using byte indices directly? Using codepoints still requires to go through every single character.

udoprog commented 6 years ago

@jclc using byte indices is not a bad idea, but I want to outline the implications of such a choice:

Either servers or clients need to communicate which encoding ranges are sent in, and one of them needs to adapt to the others requirements. Since clients are less numerous, it would seem the more economic choice for this responsibility to fall on them. In order to be backwards compatible the exchange has to be bi-directional. All servers have to support UTF-16 and fall back to this when the client indicates that this is their only capability. At least until a new major revision of the LSP has been widely adopted and the old one deprecated.

Using codepoints still requires to go through every single character.

This depends a bit on the language, but rows are generally unambiguous. They can be stored in such a way that we don't have to decode all characters up until that row (e.g. when using a specialized rope structure). With this approach we only have to decode the content of the addressed rows. Some transcoding work will happen unless the internal encoding of both server and client matches.

Edit: The reason I have a preference for codepoints over bytes is that they are inherently unambiguous. All languages dealing with unicode must have ways for traversing over strings and relating the number of codepoints to indexes regardless of what specific encodings are well-supported.

eonil commented 6 years ago

I think every problems arise from lack of precise definition of "character" in LSP spec. The term "character" has been used everywhere in the spec, but the term is not actually well-defined independently.

Anyway, LSP3 spec defines "character offset" as UTF-16 Code Unit, which means it implicitly defines term "character" as UTF-16 Code Unit as well. This is (1) non-sense as Unicode Code Unit is not intended to be a character and (2) inconsistent with other part where UTF-8 based.

In my opinion, the first thing we have to do is defining term "character" precisely, or replacing the term "character" with something else. Lack of precise definition of term "character" increases ambiguity and potential bugs.

As far as I know, Unicode defines three considerable concepts of text assemblies.

Code Unit
Code Point
Grapheme Cluster

And the closest concept to human's perceived "character" is "Grapheme Cluster" as it counts number of glyphs rather than code.

As @udoprog pointed out, transcoding cost is negligible, so accept the cost and choose logically ideal one -- Grapheme Cluster counting. This is better than Code Point and less ambiguous in my opinion.

Furthermore, Grapheme Cluster count is very likely being tracked by code editors to provide precise line/column(or character offset) information to end users. Tracking of Grapheme Cluster count wouldn't be a problem for them.

There will be two distinctive position/offset counting mode (1) legacy-compatible and (2) grapheme-cluster-counting.

legacy-compatible mode is same with current. Defines "character" as UTF-16 Code Unit.
grapheme-cluster-counting mode defines "character" as "Grapheme Cluster" and uses count of Grapheme Cluster as position offset.

In LSP3, servers should support both of legacy-compatible(deprecated but default) and grapheme-cluster-counting modes. In LSP4, grapheme-cluster-counting is the only counting method.

If Grapheme Cluster counting is unacceptable, UTF-8 Code Unit (==encoded byte count) counting can be considered instead. Character offset becomes irregular indexing number, but it'll be consistent with other part of the spec.

udoprog commented 6 years ago

@eonil Regarding grapheme clusters:

The exact composition of clusters is permitted to vary across (human) languages and locales (Tailored grapheme clusters). They naturally vary from one revision to another of the unicode spec as new clusters are added. Finally, iterating over grapheme clusters is not commonly found in standard libraries in my experience.

eonil commented 6 years ago

@udoprog I see. If grapheme clusters are unstable across implementations, I think it should not be used.

oblitum commented 5 years ago

Sorry to say but this is some kind of WCHAR legacy at full force.

godunko commented 5 years ago

From my point of view, use grapheme clusters is much better than rely on any specific encoding, even if clusters is not very stable:

We are talking about source code, use of "advanced" or "unstable" grapheme clusters is rare in source code.

Conversion from UTF-16 code units index to grapheme clusters is necessary to be done by editors to report messages anyway. It requires to have source code loaded and do processing of this source code to compute user visible position of the grapheme cluster. It complicates clients code.

Some compilers already have basic support for reporting of grapheme clusters and not any kind of representation specific indices.

KamilaBorowska commented 5 years ago

Grapheme clusters are impractical. They depend on Unicode version, and I can imagine things would get very confusing when client and server support different Unicode versions (in fact, the behaviour of grapheme clusters did change in Unicode 12.0, the most recent version at the time I wrote this comment). Additionally, chances are most server implementations would get lazy and simply not bother to support grapheme clusters, as most programming languages don't make it easy to support grapheme clusters, and not supporting those won't cause issues in 99.999% of the cases.

soc commented 5 years ago

Concerning those LSP implementers who are unhappy with the situation:

The most practical solution is to vote with your feet, and send the document offsets of the UTF-8 representation.

bstaletic commented 5 years ago

That would be a direct violation of the protocol unless both, servers and clients, implement custom ways to negotiate UTF-8.

soc commented 5 years ago

That would be a direct violation of the protocol

Yes, that's basically the point. :-)

I think it's completely unrealistic to believe that any LSP implementation will get easier from having to support both encodings, implementing the encoding negotiation and having to support the UTF-16 offsets til the end of times due to some odd editor that can't be bothered.

The LSP implementers complaining here can resolve this issue by migrating their implementation to codepoints and declaring that they won't support the legacy UTF-16 offsets.

It's either this, status quo, or the worst option: supporting both.

prabirshrestha commented 5 years ago

We definitely need a way to negotiate utf-8. For vim-lsp the performance was really bad. Benchrmarks in the PR https://github.com/prabirshrestha/vim-lsp/pull/284

Vim is one of the most popular languages but vimscript is also one of the slowest language in the world.

bstaletic commented 5 years ago

That would be a direct violation of the protocol

Yes, that's basically the point. :-)

That's also disregarding that most servers only test against vscode. That's going to be a problem for every client that is not vscode.

For vim-lsp the performance was really bad.

Ycmd didn't measure the performance impact of counting UTF-16 offsets. We just went with it. Though I doubt it's as drastic in python as it is in vimscript.

soc commented 5 years ago

@bstaletic I appreciate your concern. To make clear where I am coming from:

I found this issue because I was investigating how to write an LSP implementation for a language in the coming months. As I'm doing this on my own time, my budget for accommodating legacy cruft is roughly zero.

Therefore, my definitive statement on this matter: My implementation will use codepoints, and will support neither UTF-16 codeunits nor any kind of encoding negotiation; and I invite everyone who is interested in resolving the current situation to do the same.

kdvolder commented 5 years ago

@soc I understand where you are coming from but I really doubdt that you are going to get anywhere with this kind of obstructionist approach.

For example, as a server author I really don't much care what you do. As long as our server works with vscode, Eclipse, atom and maybe soon IntelliJ... we are happy. These are the LSP clients that we care about... pretty much. They implement the standard (or at least they try to :-). And you are making it hard for our servers to work correctly with your client. If you think that this way you can force the issue... you are wrong. Your client is not on the list that we care about. It does not affect us. And on the off chance that somebody actually raise a bug about this with us to say our server doesn't work properly with your client... guess what... we will just point the finger right back at you and move on with supporting the clients we actually care about.

oblitum commented 5 years ago

@soc I've commented on the theme of forking before, there's some discussion on other issues related to it here.

soc commented 5 years ago

@kdvolder Thanks for your kind words. I have to disagree on the characterization of the approach as "obstructionist" though – I would consider it to be a results-drive approach.

One of 4 things will happen:

Nothing will change, people are unhappy.
Negotiation will be introduced, forcing developers to implement support for both Unicode codepoints and UTF-16 codeunits. Lots of implementation complexity, lots of finger-pointing and arguing whose job it is to implement. People will be even more unhappy.
LSP implementers migrate to codepoints on their own. Problem gets resolved within weeks.
After a long discussion, everyone agrees to switch to codepoints. Problem gets resolved in years.

Due to my limited time, I'm forced to pick number 3. I could wait for number 4, but that would incur some delays which aren't strictly necessary.

From my point of view, number 3 is the best approach to resolve this issue, especially as clangd and rls are already considering this too.

If there are other approaches I may have missed, I would be happy to learn about them. Thanks!

szatanjl commented 5 years ago

It is obvious that in the perfect world option nr 4 would be best, but apparently it is not the option that will ever happen.

This issue is over 1 year old. Since the day that issue got created number of issues on this project went from ~70 to 154 today. The ideas of the universal protocol are great on paper, but it looks like execution of this ideas are made without universal thought. Instead it appears that driving force is the ease of implementation for Microsoft tools and since they are happy enough with the protocol its development slowed down.

Option nr 2 is IMO the worst. It is better to not do anything than to introduce negotation and in the end need to support both UTF8 and UTF16.

And so in the imperfect world we live in looks like the best option we have is option nr 3 (or forking), and if enough people will follow it will work out for better for everyone.

puremourning commented 5 years ago

Therefore, my definitive statement on this matter: My implementation will use UTF-8, and will support neither UTF-16 nor any kind of encoding negotiation; and I invite everyone who is interested in resolving the current situation to do the same.

I suspect then, that you'll just have a server that nobody uses. Or one that occasionally breaks in unfortunate ways. Client implementers (like me) are not going to write to a non-conforming server implementation, for the same reasons: we're doing this in our own time, and we don't have the time to write and test server-specific code for a technically broken implementation. We just won't support it, or our users will just have a bad experience and blame us.

Having a specification that is clear, albeit not ideal, is better than having 2 competing specifications. I agree that UTF16 is unfortunate, and that in ycmd we had to write a bunch of fiddly code to support it and we had to write a bunch of fiddly tests to test it. But at least we only have to do that once. That's the real power of LSP (with the caveat that most implementations are somewhat nonstandard and the protocol itself includes the requirement for server-specific knowledge in the client : commands).

kdvolder commented 5 years ago

LSP implementers migrate to UTF-8 on their own. Problem gets resolved within weeks.

It sounds a tad optimistic to assume that within weeks... all existing clients and server will adopt UTF-8. Especially considering that it goes against the standard. Maybe it helps you get on with things, so I can understand you might just do that (and hey, it probably doesn't matter unless the user starts typing their code with some bizzarre unicode characters rather than typical plain ASCII) but it hardly 'resolves' the issue, does it now.

Avi-D-coder commented 5 years ago

@puremourning Many clients use UTF8 or codepoints already. Most people don't notice because astral chars are uncommon in source and ranges are only incorrect when an astral char is in the line your using.

It would be interesting to survey known clients and servers to see what they are actually using. Edit I'm making a survey at lsp-range-unit-survey

puremourning commented 5 years ago

You said "uncommon", I said "occasionally". I think they are equivalent.

However you interpret them, the result is still that you get bad user experience when it happens. The user doesn't care that their code contains "uncommon" symbols, just that their experience with the product was bad.

Moreover, we have the test cases and bug reports that prove that "uncommon" is not the same as "never".

Avi-D-coder commented 5 years ago

@puremourning Absolutely, an "occasional"/uncommon issue is still a problem. That's why I think this issue should be resolved in relatively short order. I emphasized "uncommon" to imply that changing to units is not a very bad breaking change compared to the situation that I have observed in the implementations I've used (~4 utf16, ~4, utf8, ~2 codepoints) Hence I am making a survey to definitely know where we currently stand with compliance.

soc commented 5 years ago

You said "uncommon", I said "occasionally". I think they are equivalent. However you interpret them, the result is still that you get bad user experience when it happens.

This is already the case: as @Avi-D-coder mentioned, more than half of the implementations he checked ignore the spec in this regard.

Having a specification that is clear, albeit not ideal, is better than having 2 competing specifications.

I think then the solution that makes everyone happy is clear: Update the specification!

Avi-D-coder commented 5 years ago

@soc My little count is anecdotal and by memory, That's why I made github.com/Avi-D-coder/lsp-range-unit-survey. Please help by sending PRs.

MaskRay commented 5 years ago

My implementation will use UTF-8, and will support neither UTF-16 nor any kind of encoding negotiation; and I invite everyone who is interested in resolving the current situation to do the same.

I've done the same in my language server ccls: it only implements UTF-8. This is not a big issue in practice because people rarely use non-ASCII in the C/C++ code. When they do (in string literals (that doesn't affect characters in other lines), nearly never in identifiers), it is not a problem: the existing Emacs/Vim language clients support UTF-8.

dbaeumer commented 5 years ago

@Avi-D-coder thanks for doing the survy. It definitelly helps to make a more informed decission.

As others I am not a fan of making this negotiable on both ends since it doesn't help in any ways. I am neither a fan of simply starting to break things. We tried hard so far to avoid any breakage in the protocol.

If we really come to the consolusion that an additional format is necessary then the only reasonable way forwards for me would be the following:

we come to a conclusion which other formats/encoding should be supported. I am in absolute favor to only add one.
there are a lot more servers than clients (https://microsoft.github.io/language-server-protocol/implementors/servers/). So we only allow servers to pick an encoding / format. Clients need to support ALL encodings / formats. This limits the implementation effort.
we vote by PR and not by feet :-). This means that people pushing for this help client implementations to support the additional format (e.g. by providing corresponding PRs).
as soon as the additional encoding / format is supported in the clients we add a capability to the protocol that tells servers that they can choose between two formats.

dbaeumer commented 5 years ago

Actually I am not so sure anymore about that approach. It will force the client to open the document to do the conversion even if it is not presented in the editor. The server usually has the content of the files read for which it reports results.

sam-mccall commented 5 years ago

I'm about to land support for UTF-8 in clangd.

clangd is utf-8 internally, and abides by the protocol by transcoding. However many clients only support utf-8, and we want to work with them.

We've got a backwards-compatible protocol extension for negotiating encoding: https://clangd.github.io/extensions.html?#utf-8-offsets

For clients/servers that only support one encoding, this is very simple to implement: just drop in a single static property on ClientCapabilities/InitializeResponse. I'd suggest clients/servers that care about this problem also implement this extension.

clangd will also support a -offset-encoding=utf-8 flag as a user-acccessible workaround for clients that only support UTF-8 and don't implement this extension.

@dbaeumer I'm happy to send a pull request for a protocol change if that seems useful to you. I've implemented this in a server and will likely also add it to clients. I'm not likely to send a PR to the nodejs client/server though (others are of course free to do so).

EDIT: the clangd implementation for reference: https://reviews.llvm.org/D58275 (This is nontrivial because clangd will support both utf-16 and utf-8)

dbaeumer commented 5 years ago

@sam-mccall thanks for offering your help here. But what we need (if we want to do this at all) is supporting a different encoding in clients. Updating the protocol spec is trivial compared :-)

bstaletic commented 5 years ago

As far as ycmd (a client) is concerned, we're in the same boat as clangd. We do everything in UTF-8 and then have some piece of code to calculate UTF-16 offsets, so going back to UTF-8 would be easy.

Perhaps having the extension in the protocol specification would help clients adopt it.

sam-mccall commented 5 years ago

@dbaeumer Agreed we need implementations, though specifying this may encourage them as @bstaletic says.

Server implementations of multi-encoding support might be just as valuable as client ones: if servers support multiple encodings, we get concrete interop wins (with utf-8 only clients) by having the clients blindly request UTF-8 (which is a trivial change).

While servers outnumber clients, they are also often written in fast (or fast-ish) languages with good library access for transcoding, vs clients that are often written in slower languages with limited libraries.

soc commented 5 years ago

@dbaeumer

we come to a conclusion which other formats/encoding should be supported. I am in absolute favor to only add one. So we only allow servers to pick an encoding / format. Clients need to support ALL encodings / formats. This limits the implementation effort.

Isn't that pretty much the worst case scenario detailed above?

I am neither a fan of simply starting to break things. We tried hard so far to avoid any breakage in the protocol.

Me too, if I can avoid it. So what exactly is preventing us from keeping LSP 3 as-is (UTF-16 offsets) and release LSP 4 with i. e. UTF-8 offsets?

Compared to the approach of implementing negotiation machinery and forcing servers to support both, releasing an updated spec version means that there is a definite EOL to UFT-16 offsets, instead of having to support them forever:

As soon as server implementers feel that the clients they care about all implement LSP 4, they can drop whatever workarounds they have for UTF-16 and move on, without being weighted down with legacy baggage in the protocol or the implementation.

puremourning commented 5 years ago

Isn't the version of the protocol arbitrary? There's no version identifier in the init exchanges.

puremourning commented 5 years ago

As far as ycmd (a client) is concerned, we're in the same boat as clangd. We do everything in UTF-8 and then have some piece of code to calculate UTF-16 offsets, so going back to UTF-8 would be easy.

Perhaps having the extension in the protocol specification would help clients adopt it.

https://github.com/Valloric/ycmd/blob/master/ycmd/completers/language_server/language_server_protocol.py#L496-L532 is the code.

soc commented 5 years ago

I am neither a fan of simply starting to break things. We tried hard so far to avoid any breakage in the protocol.

While I agree with you that not sending the protocol version during initialization is a major oversight, the comment regarding LSP 3 vs. LSP 4 is less about mechanical protocol negotiation, and more about client devs declaring "we upgraded to LSP 4" and server devs changing their implementation accordingly.

It's largely a matter of having something like "LSP 4" as a short marker of the change instead of "we changed the encoding of the values inside some nested structure", especially when it comes to communicating the fix to users, which might inquire about the status of this fix in the client they are using.

szatanjl commented 5 years ago

I would like to point out that there is a difference between byte and UTF-8 codeunit. Whenever you guys are talking about UTF-8 I am not sure which one do you mean.

If some source file would be encoded in ISO-8859-1 for example, would all of those LSP implementations using "UTF-8" actually convert the encoding to UTF-8 and use UTF-8 codeunits, or would they actually use bytes? For example letter ä (0xE4) encoded in ISO-8859-1 would be one byte, one codepoint but actually two UTF-8 codeunits.

@Avi-D-coder Maybe it would be a good idea to distinguish between the two in the survey?

microsoft / language-server-protocol

Change character units from UTF-16 code unit to Unicode codepoint #376