Change character units from UTF-16 code unit to Unicode codepoint

MaskRay commented 6 years ago

Text document offsets are based on a UTF-16 string representation. This is strange enough in that text contents are transmitted in UTF-8.

Text Documents
......... The offsets are based on a UTF-16 string representation.

Here in TextDocumentContentChangeEvent, range is specified in UTF-16 column offsets while text is transmitted in UTF-8.

interface TextDocumentContentChangeEvent {
    range?: Range;
    rangeLength?: number;
    text: string;
}

Is it more reasonable to unify these, remove UTF-16 from the wording, and use UTF-8 as the solely used encoding? Line/character can be measured in units of Unicode codepoints, instead of UTF-16 code units. A line cannot be too long and thus doing extra computing to get the N'th Unicode codepoint would not lay too much burden on editors and language servers.

https://github.com/jacobdufault/cquery/issues/57

Survey: counting method of Position.character offsets supported by language servers/clients https://docs.google.com/spreadsheets/d/168jSz68po0R09lO0xFK4OmDsQukLzSPCXqB6-728PXQ/edit#gid=0

bstaletic commented 5 years ago

That's a great point. Like I said, ycmd does all internal work in UTF-8 and only converts offsets to/from UTF-16 when talking to a LSP server. That means an ISO-8859-1 encoded file will result in UnicodeDecodeError exception and ycmd will stop working.

soc commented 5 years ago

I'm referring to UTF-8 codepoints. The advantage compared to counting bytes is that – except with characters above the BMP– it doesn't break existing users.

We know this, because according to the survey half the implementations we know about already do this. The world hasn't ended.

Let's move the other half over and be done.

szatanjl commented 5 years ago

There is no such thing as UTF-8 codepoint. There is Unicode codepoint, UTF-16 codeunit (currently LSP uses this), UTF-8 codeunit and byte. Unless I am mistaken - I am not a member of Unicode standard committee.

And since survey says UTF-8 it is not at all clear, at least for me, if it means UTF-8 codeunit or byte.

Avi-D-coder commented 5 years ago

I have a separate category for Unicode codepoint. I was assuming people understood UTF-8 to means UTF-8 codeunit. RLS for instance uses codepoints (rust's char). It maybe a good idea clarify this.

soc commented 5 years ago

@szatanjl A codeunit in UTF-8 is 8 bits, it is equivalent to counting the bytes. A codepoint is 21bits, only its encoding differs between transfer formats.

You are correct that "UTF-8 codepoints" doesn't make much sense because a "UTF-8" codepoint == "UTF-16" codepoint == ...

Unicode codepoints is what I failed to express properly. They have the benefit that the result of queries to the language server returns the same value as before.

szatanjl commented 5 years ago

@szatanjl A codeunit in UTF-8 is 8 bits, it is equivalent to counting the bytes.

Maybe I didn't make myself clear. By "bytes" I meant bytes in the encoded file, not bytes of the file in memory that at this point might be converted to UTF-8.

@soc And with above in mind you are mistaken. UTF-8 codeunit is equivalent to counting bytes if and only if file is encoded in UTF-8. If the file is encoded in ISO-8859-1 for example then counting bytes and UTF-8 codeunits is not the same. You can look at my above comment for example.

soc commented 5 years ago

@szatanjl The offset values refer to data sent as UTF-8, and that data is sent as UTF-8 regardless of the encoding of the original file.

szatanjl commented 5 years ago

@soc The offset values refer to document that doesn't have to be UTF-8 encoded.

Citation from specification, section "Text Documents" > "Position":

> interface Position {
> ...
>   /**
>    * Character offset on a line in a document (zero-based). Assuming that the line is
>    * represented as a string, the `character` value represents the gap between the
>    * `character` and `character + 1`.
> ...

It states "Character offset on a line in a document". Not in a sent data.

soc commented 5 years ago

@szatanjl Sorry, I got confused. You are correct.

Do we have any understanding whether or how a switch from UTF-16 codeunits to Unicode codepoints would impact other legacy encodings such as ISO-8859-1?

Avi-D-coder commented 5 years ago

Do we have any understanding whether or how a switch from UTF-16 codeunits to Unicode codepoints would impact other legacy encodings such as ISO-8859-1?

@soc As far I know it would depend on clients handling the text. All the servers I know of reincode internally.

The biggest problem with the current UTF-16 codeunit is that many clients and editors can't or won't conform. Mandating UTF-16 is the equivalent requiring the client of keep an extra copy of the text just to handle astral chars. Most non vscode based clients that I know of will never do this. While there are certainly more servers then clients servers are in a better position to handle complexity. Clients want to be as thin as possible.

I will survey the clients today.

Avi-D-coder commented 5 years ago

At this point I will not be opening any more issues for the survey, if people want to add more data points, open a issue or ask a question on a lsp repo using the template in all those issues and send the survey a PR. I will continue to update the survey as results come in.

eonil commented 5 years ago

If LSP spec is fully based on UTF-8, everything can be far more simpler. Therefore, there must be a clear and strong benefit to introduce UTF-16 to justify extra abstraction and implementation cost.

The only known benefit of involving UTF-16 is eliminating extra offset transformation in some systems based on UTF-16. I still don't understand why these UTF-16 based systems deserve such subtle level of extra optimization involving unnecessary extra dependencies at protocol level. Isn't it best to hide such dependencies into local machine? Why do we need to make protocol far more complex to benefit some platforms? LSP is already mostly based on UTF-8 and involving UTF-16 requires far more details to deal with such as endianness, BOM, UCS2, surrogate pairs, and etc.. Why everyone has to pay these cost for those special platforms?

In my opinion, if grapheme clusters are not acceptable, next best option would be something with UTF-8 whatever. I really can't find any reason to deal with UTF-16 at protocol level.

And encoding negotiation would make situation worse, because every implementation has to implement extra abstraction/transformation layer or afford less compatibility.

godunko commented 5 years ago

Use of Unicode Code Points may be good starting point for all clients and server. Grapheme clusters are much better, but... they are little bit more complicated. In this case it is not important which encoding is used by document. It can be UTF-8 or KOI-8R or CP 1251. The first one use 1/2/3 byte sequences to represent characters of last two.

Most client/server developers wants to don't support all complexity of character encoding because it provides no value in their eyes. It is fine till it is sufficient to use ASCII in source code. Outside of ASCII characters set computations is more complicated and, most important for developers, it requires to rewrite a lot of code to add such support. Each developer can select own way, but LSP specification should be useful for processing of documents in any encodings, thus needs to use some encoding independent way to address positions of user visible characters.

sam-mccall commented 5 years ago

Here's my attempt to summarize the technical issues and alternatives. I'd encourage people to add the offsetEncoding extension to their client/server, especially if it uses a fixed encoding that's not UTF-16 (in which case it's trivial).

Servers get open file content via LSP, and non-open files (e.g. imported) from disk. So servers always need to be aware of on-disk file encodings, in order to communicate about offsets consistently.

Everyone's using unicode, encodings used internally and on-disk vary. Unicode codepoints are the common representation (everything else is one hop away).

The alternatives we've discussed:

byte offset in file (regardless of encoding) - I think this basically doesn't work: we don't have this info for content sent over LSP. Also in practice, this doesn't fit well with the APIs available to editor plugins (clients) and parser libraries (servers).
codepoint (i.e. UTF-32 code unit) - neutral option that ensures just two conversions in the worst case (one on client and one on server). Easy to understand and reasonably consistent with the LSP protocol. Fewer illegal cases to consider (e.g. splitting surrogate pairs).
UTF-16 code units - status quo. Easy for UTF-16-native clients/servers, very hard for others. Worst case is 4 conversions needed (client-native -> codepoints -> utf-16 -> codepoints -> server-native). Inconsistent with rest of protocol, which uses unicode and UTF-8.
UTF-8 code units (bytes) - in the abstract similar to UTF-16: easy for some clients/servers, hard for others. More common interchange encoding, consistent with the rest of LSP.
grapheme clusters - Compatibility issues across unicode versions. Hard to implement without libraries. No illegal cases to consider.
dynamic negotiation - Allows correctness when one side is multi-encoding aware and the other side doesn't support UTF-16. (Most commonly, UTF-8 only clients or servers). Improves performance when client/server share a native representation that is not the standard UTF-16.

Based on this I'm coming around to the idea that codepoints (i.e. "utf-32") might be a sensible compromise. UTF-16-native clients are numerous (JS, Java and C# are everywhere) and dealing with UTF-8 is almost as annoying for them as UTF-16 is for UTF-8-native clients. Counting codepoints is pretty easy in both representations.

I would encourage the use and eventual standardization of the negotiation extension to help get out of the current mess.

soc commented 5 years ago

@sam-mccall I'm not really seeing how introducing negotiation and forcing implementers to support multiple encoding variations can be considered getting out of this mess.

From my point of view, this is making the mess even bigger than just giving up and accepting status quo.

sam-mccall commented 5 years ago

forcing implementers to support multiple encoding variations ... is making the mess even bigger

Thanks for raising that, I agree!

I'm not proposing anyone be required to support multiple encodings, instead:

the proposal allows implementations that only support one encoding to trivially specify that in their capabilities request/response
it allows implementers that want to support multiple encodings to choose the appropriate one
it doesn't change the current half-working behavior when the client and server use different encodings, but allows either side (or a viewer of the logs) to detect that scenario.

soc commented 5 years ago

Isn't the end result

lots of finger-pointing and arguing whose job it is to implement

in the end, as predicted in https://github.com/Microsoft/language-server-protocol/issues/376#issuecomment-476804700?

The collective time people here spent discussing whose-job-it-is and complicating things with encoding negotiation is probably already close to the time it would have taken to simply fix all clients and servers.

Let's not turn this into some multi-year design-by-concerned-middle-management project, please.

We have already a pretty good list of clients and servers, so let's get this done.

As @dbaeumer said, "we vote by PR and not by feet":

I offer creating a PR against the LSP spec and against one or two random LSP implementations to migrate them from $randomThing they are doing to Unicode codepoints.
If everyone chips in to at least notify implementers of the fix, I'm sure we can largely be done by next week.

puremourning commented 5 years ago

If everyone chips in to at least notify implementers of the fix, I'm sure we can largely be done by next week.

I completely disagree. I think you are hugely underestimating and trivialising the work being created here for compliant implementations.

Avi-D-coder commented 5 years ago

@puremourning By changing the spec from characters to UTF-16 codeunit a lot of work was already created. At this point I am not at all confident that we can consider compliant implementations to be the majority.

Implementations Count

UTF-8: 11
UTF-16: 10
Codepoints: 6
grapheme clusters: 0 note: Multiple implementations in the same repo or derived from a shared dependency are counted once. Several compliant implementers would prefer UTF-8.

Given the data so far Code points seem like a decent compromise.

Only astral chars break.
Codepoints are available in just about every language.
Are basically made for bridging encodings purpose.
Many people originally interpreted "characters" from the spec as codepoints.

puremourning commented 5 years ago

Ok but the quotes comment claims that we confirming implementations would change in a week, which is a baseless claim as we like most people here do it in our spare time and we don’t abide random baseless deadlines set so that others can avoid implementing the specification.

Avi-D-coder commented 5 years ago

@puremourning A week is being extremely optimistic, but this issue has been open for a year and the longer it takes to switch the more implentions switch to UTF-16. At least 2 are in the process right now.

If the spec is going to be reverted it should be reverted now. If the survey data is representative, the spec is broken, not the majority of implementation using something else. Thus changing to codepoints should not be considered a breaking change. It should be considered fixing and erroneous edit to the spec, or a clarification. The word "character" is not synonymous with UTF-16 codeunit. If the majority of implementations after a year are still non conformant, what does that say about how people originally interpreted meaning of character.

With that being said, I don't know that the majority is non conformant. It could be that the sample is not representative. I picked it based on stars and easy of inquiry.

dbaeumer commented 5 years ago

Thanks all for the lively discussion. I read through the new posts. I have a couple of comments and clarifications:

Regarding versioning

LSP has feature version support (this is why we don't send a global version number). If we add new capabilities to the protocol they are guarded by a client capability and / or server capability. If we would add another client encoding this would be done as follows in LSP:

we add a client capability encoding
we add a server capability acceptedEncoding.

Assuming we have a client has utf-8 encoding only encoding would be set to utf-8. If the server accepts it it signals this through acceptedEncoding. If acceptedEncoding is unset then the client knows it is a standard utf-16 server to which he can't talk.

Regarding protocol using UTF-8 & UTF-16

The only part that is UTF-8 specific in the protocol is how the JSON structure should be be encoded to bytes when sent over the wire. This encoding is customizable in principle (see Content-Type in the header). Adding another one would be trivial since it is comparable to converting files from disks into strings using an encoding. Almost all programming languages I know of have support for various encodings. The reason why we choose utf-8 is simply transfer size. It has IMO no extra cost neither on the client nor on the server.

Things are complete different with positions, since they denote a character in a string. The reason why we choose utf-16 when starting with LSP was that the programming languages and editors we looked at and used were using utf-16 to represent strings internally. Things have changed since then and all new programming language usually represent strings internally using utf-8.

@sam-mccall Thanks for this great encoding summary

I am also not sure anymore if it is a smart idea asking clients to support n encodings. The reason is that to actually convert positions from one encoding to another the content of that line must be available. That for example means for a find all reference result to open all files, read them into memory, do the position conversion and forget them again. This might have a bad performance impact especially when files come from remote.

I do agree with @sam-mccall now that doing the conversion on the server although there are more servers than clients is smarter for the following reasons:

as @sam-mccall mentioned they are also often written in fast (or fast-ish) languages.
server usually have the file content in question already in memory.

I also tried to look at this from a different angle. Instead of focusing on the clients and server it might be better to focus on the programming languages used to implement them. These usually determine how strings are encoded in memory and how they are indexed (if indexable at all). I came up with the following table so far:

Language	Encoding
JavaScript	UTF-16
TypeScript	UTF-16
.Net (C#)	UTF-16
Java	UTF-16
C/C++	byte (UTF-8, UTF-16)
Go	byte (UTF-8)
Python	byte (own format)
Rust	UTF-8 (no indexing)
Ruby	UTF-8 & UTF-16
Lisp	unknown
Haxe	platform
vimscript	UTF-8

So may be an approach would be the following: instead of helping clients to support an additional encoding besides utf-16 the LSP community invests into libraries for the common programming language to do the position conversion into another encoding. Then many servers (or even clients) could simply reuse these libraries.

soc commented 5 years ago

@dbaeumer I'm not sure I understand the exact intention of the approach:

do the position conversion into another encoding

Could you clarify what conversions you have in mind here?

LSP community invests into libraries [...] Then many servers (or even clients) could simply reuse these libraries.

Isn't such functionality (at least if I understood you correctly) something that can usually be accomplished by single method call to a method that most likely already lives in most standard libraries?

sam-mccall commented 5 years ago

Weeks is not a realistic estimate even if all fixes landed today. For our server, the next release is in 6 months and I wouldn't expect ~all users to be on it for 2 years.

@dbaeumer The protocol you suggest looks OK to me, if we are assuming clients will advertise support for only a single encoding, and therefore ~all servers should implement multi-encoding support. In this case I think servers should be strongly encouraged to support UTF-32 as well as 16 and 8, as some clients will need this. Happy to write a patch to update clangd's negotiation once there's a PR for spec text.

clangd has a decent implementation of the length conversions for a UTF-8-native implementation, in case anyone wants to port them to other languages. measureUnits and lspLength are the key functions.

soc commented 5 years ago

Weeks is not a realistic estimate even if all fixes landed today. For our server, the next release is in 6 months and I wouldn't expect ~all users to be on it for 2 years.

How is this an issue?

It will not cause trouble to anyone. Otherwise we would have that trouble today. Because almost 60% of the implementations don't follow the spec as we speak. If the world hasn't ended yet, fixing this mess will not end it either.

sam-mccall commented 5 years ago

How is this an issue? It will not cause trouble to anyone.

Your proposal is IIUC to simply change the spec to say unicode codepoints instead of UTF-16 code units, and start fixing clients/servers.

Clients/servers that are spec-compliant (UTF-16) work together today. e.g. clangd 8 and vscode N. If you change the spec and commit fixes, and those are released in clangd 9 and vscode N+1, then e.g. vscode N+1 won't work properly with clangd 8. This situation will persist for years.

I understand some clients/servers are broken today, but your proposal will break some that work today. (I'm going to leave it at that, because I don't think there's any prospect a change without back-compat will be accepted for the spec)

soc commented 5 years ago

This situation will persist for years.

The situation has already persisted for years. It's fine. Literally nothing happened. It's such a non-story that even most developers of clients and servers have not realized that something was wrong.

but your proposal will break some that work today

If you are not trying to argue that every (non-)broken server implementation only happened to be used by a client implementation that was (non-)broken in exactly the same way – by pure chance or magic – then it's absolutely clear that clients and servers who implemented the spec differently have already been used together for years, largely without anyone even realizing it.

dbaeumer commented 5 years ago

I agree with @sam-mccall that we can't simply change the spec. This is very unfriendly to eveyone that adhered to it until now and something I really try to avoid.

As I outlined as well we should try to avoid that clients need to do position transformations to its internal string representation since this requires to have the content of the file loaded into memory. This is why I tend to agree that servers should support more than one encoding.

IMO to move this forward we need to do the following:

agree that putting the burden onto the server is the right thing to do. If we do
agree which encodings a server should support
provide corresponding helper libraries to do this for the programming languages commonly used in servers. @soc converting the content is trivial and most language have libraries for this. But converting an index into the string is non trivial as @sam-mccall showed with his code.
define how a client can pick an encoding and add this to the protocol in a non breaking way.
fix the servers.

Avi-D-coder commented 5 years ago

@dbaeumer

agree that putting the burden onto the server is the right thing to do. If we do

I think we can all support this.

agree which encodings a server should support

While I would prefer a single codepoint api, since all editors must deal with emoji they all have codepoint indexing capabilities built in. If multiple encodings are going to be supported, UTF-8, UTF-16 and Codepoints should be available. Grapheme Cluster are not presently used by any known implementation and are not unicode stable, so should be excluded.

soc commented 5 years ago

@Avi-D-coder I have trouble to understand how expecting servers to provide and maintain two additional implementations in addition to the existing one is an improvement over the status quo.

kdvolder commented 5 years ago

agree that putting the burden onto the server is the right thing to do. If we do

I think we can all support this.

No! You don't speak for all of us. As a server author I really don't want to be dealing with trying to support multiple encodings. For crying out loud. Just please pick one already! UTF16 works fine for us. But if you really must than change it to UTF8 or whatever... do it. But please ... just don't make server implementors support an array of different encodings.

kdvolder commented 5 years ago

I agree with @sam-mccall that we can't simply change the spec.

I think we can. It's a choice we can make.

This is very unfriendly to eveyone that adhered to it until now and something I really try to avoid.

Right, I would sort of agree. On the other hand making us support multiple encodings isn't exactly 'friendly' either.

Personally, our language servers are implemented in Java, so I think they 'accidentally' adhere to the spec. So the status quo works fine for us. But I think I can honestly say that I'd rather deal with changing our lanuage servers to support UTF8 (or whatever is chosen) over supporting both UFT8, unicode points and UTF16. One encoding is really enough.

Avi-D-coder commented 5 years ago

@soc I don't expect most servers will maintain multiple formats. I am for Codepoints being the official and only format. UTF-16 should be deprecated and phased out slowly on the clients that support it.

However if UTF-16 is not being removed I fail to see why UTF-8 implementers would choose to sacrifice performance and convenience in return for no compatibility guarantee. I see supporting all three units as recognizing the broken state of the spec in this lucky relatively insignificant aspect.

If you take a look at who conforms to UTF-16, and who won't conform; it goes mostly along the traditional line: "enterprise" vs "modern" languages and editors. Editors/Clients almost all have the ability to use codepoints. Codepoints are the clear compromise, but the compromise seems to have been rejected, so let darwin figure it out. Without the incentive of compatibility with all servers, I am not confident many UTF-8 devs will switch.

As for codepoints I still have hope people will slowly adopt it, if it's an option.

@kdvolder fair enough, but putting the burden onto the server does not necessarily mean multiple competing formats, I would consider it to be a general principle. Clients are often in slow or crippled languages like vimScript or elisp and even some sh in one. If the spec demands a heavy client, the spec will not be followed (this is already happening).

puremourning commented 5 years ago

it goes mostly along the traditional line: "enterprise" vs "modern" languages and editors.

I really didn't understand this.

Clients are often in slow or crippled languages like vimScript or elisp

OK .. easy! Many servers are written in old crippled languages like javascript too, but I don't really see why that is relevant to the conformance or otherwise with a specification.

But seriously, this has nothing to do with age or crippledness of language implementation, other than precisely what @dbaeumer already said. A number of implementations internally used UTF 16 code units for the "string length" operation and for "string indexing".

The language-du-jour is likely not the bottleneck to he performance issue which, as has been accurately reported, is that to convert offsets, you require the line in the file in a known encoding. This is an I/O operation which is more likely to be a performance determining issue than any reasonable runtime language, even an interpreted one (or old and crippled if you prefer to throw around such terms),.

For what it's worth, I don't even subscribe personally to the argument that servers are more likely to have the file in memory. If the client is sending a request for a URI, it probably has that file in memory. It probably sent the contents the the server in the first place. If not it's probably just about to open the file to do something with the offset anyway. But that's neither here nor there in my opinion.

Aaaand finally, I do agree that just making a decision (rather than decision by committee) is probably better. I think any option that doesn't involve multiple encodings is a good enough option, i.e. either:

Do nothing and call any currently non-conforming implementation broken
Change the spec (per the PR), apologise to the good citizens who implemented the spec, and move on.

Both options will require some servers in some languages or runtimes to have to do conversion, but at least we have a clear and unambiguous specification (like we do now), and a likelihood that it will be followed (like we don't now, allegedly).

haferburg commented 5 years ago

That for example means for a find all reference result to open all files, read them into memory, do the position conversion and forget them again. This might have a bad performance impact especially when files come from remote.

@dbaeumer But isn't it necessary to open the files anyways for "find all references"? What does a client do with the result? Typically you need to display some context, so at least the line that contains the symbol. So I don't see how "open up files unnecessarily" is an issue.

bmewburn commented 5 years ago

I say keep it as is, it's done. Though if it is changed then make it code points.

the protocol was design to support tools and their UI

Which is why I don't really understand the benefit of bringing encodings into this at all. It just increases complexity of the protocol and favours one language over another.

Change it to another encoding and it benefits client/server implementation language X at the expense of client/server implementation language Y. Language Y devs and CPU have to do more work.

Change it to many different encodings and it benefits client language X at the expense of server implementation language X, Y, Z. All server devs and CPU have to do more work.

So if it must change, keep it fair -- code points -- so that everyone suffers equally ;)

dbaeumer commented 5 years ago

@haferburg most lsp clients are implemented against extension API (this is even the case for the VS Code LSP client). So the conversion usually happens before the data actually reaches the editor / tool. This extension API normally only surfaces position API tailored to the string representation used in the editor. So if the editor internally uses UTF-16 the API is UTF-16 based. So even if the editor later on opens the file the LSP client has to do the same. Things might be different if we can convince editor API owners to support multiple encodings (which I doubt will happen)

dbaeumer commented 5 years ago

Here is an implementation to convert position between utf8 and utf16 in TS. https://github.com/NeekSandhu/onigasm/blob/master/src/OnigString.ts @aechli thanks!

And here one in C++: https://github.com/atom/node-oniguruma/blob/9e3334b4fbe50752ec672fed29c48fc583e44485/src/onig-string.cc#L32-L85 @alexandrudima thanks!

soc commented 5 years ago

@dbaeumer I think the crucial point is not that people don't know how to do it, but they don't want to.

micbou commented 5 years ago

Which is why I don't really understand the benefit of bringing encodings into this at all. It just increases complexity of the protocol and favours one language over another.

Except that there are good reasons to pick UTF-8 over UTF-16:

UTF-8 is the most popular Unicode encoding;
the misconception that UTF-16 is a fixed-length encoding. Lots of software using that encoding don't properly support Unicode because of that assumption. I wouldn't be surprised if VS Code is one of them. UTF-8 is well known to be a variable-length encoding;
UTF-8 is endianness independent. By the way, the specs don't mention which endianness is used for UTF-16: little or big endian?
UTF-8 is taking less space than UTF-16 in this context since code is mostly written in ASCII;
UTF-8 is already used to transmit the data.

If we agree on changing the encoding to UTF-8 in the spec then (and only then) we should discuss on the offset to use. There are two reasonable choices with that encoding (contrarily to three with UTF-16; another reason to use UTF-8):

code point offset;
byte offset (or code unit offset if you prefer).

I am in favor of using byte offsets because they directly represent the index of the encoded string while a Unicode representation is needed for code point offsets.

dbaeumer commented 5 years ago

@soc Hmm, but I don't see how this is feasible since different programming languages use different encodings. So someone needs to convert (see https://github.com/Microsoft/language-server-protocol/issues/376#issuecomment-477983442)

@micbou Some comments regarding your post:

UTF-8 is the most popular Unicode encoding;

I agree when it comes to storing text in files but not representing text in memory in programming languages (see https://github.com/Microsoft/language-server-protocol/issues/376#issuecomment-477983442)

the misconception that UTF-16 is a fixed-length encoding

VS Code does handle surrugate pairs correctly :-)

UTF-8 is endianness independent.

whether the programming language uses LE or BE to store the string in UTF-16 in memory has no impact on the position information (character index). This is why it is not mentioned in the spec.

UTF-8 is taking less space

Agree in regards to space. But programming language usually come with on fixed internal representation. I disagree with the statement that code is mostly written in ASCII. Especially if we take Asia into account.

UTF-8 is already used to transmit the data

Yes, and this is for size reasons. As I tried to explain here https://github.com/Microsoft/language-server-protocol/issues/376#issuecomment-477983442 these are two orthogonal issues. It is like Java can read the content of a file in memory that is encoded in UTF-8 although its internal string representation is UTF-16.

bstaletic commented 5 years ago

UTF-8 is the most popular Unicode encoding;

I agree when it comes to storing text in files but not representing text in memory in programming languages (see #376 (comment))

Except that comment is biased towards "javascript-like" (TypeScript and JavaScript) and "java-like" (C# and java). Also calling C/C++ "UTF-8/UTF16" is wrong, because they are both completely encoding agnostic and any non-ascii encoding needs to be handled by a library or hand written code, not to mention that C doesn't really have a string type.

I'm sure the bias was just a result of familiarity with different languages, but every language I have actually had a chance to work with used bytes (I'm not counting java, because I have barely touched that language). Also, the list is missing vimscript which, I believe, has at least 4 clients written in it, which uses UTF-8.

UTF-8 is taking less space

Agree in regards to space. But programming language usually come with on fixed internal representation. I disagree with the statement that code is mostly written in ASCII. Especially if we take Asia into account.

If ASCII were not the vast majority using LSP would have been a complete mess, because, according to the above survey, if you pick a server and a client at random, chances are that they won't be "talking" about the same encoding offsets. And yet people are largely unaware of the mess we actually have today.

Not only are users of clients and servers unaware of this mess, but some client/server implementers are also unaware.

kdvolder commented 5 years ago

If ASCII were not the vast majority using LSP would have been a complete mess

I think that's really a bit of an exageration. Even if you did use some esoteric characters in your code, you will most likely not experience a complete breakdown of the tooling. Instead, more or less the worst thing that happens is that things like the positions of error markers will be of by a few characters occasionally. The tools will, for the most part, be perfectly usable.

soc commented 5 years ago

dbaeumer: Hmm, but I don't see how this is feasible since different programming languages use different encodings. So someone needs to convert.

This is correct, but converting to UTF-16 codeunits is not going to happen.

micbou commented 5 years ago

@dbaeumer

I agree when it comes to storing text in files but not representing text in memory in programming languages (see #376 (comment))

The issue is that you are considering languages that chose UTF-16 because people thought that 16 bits would be enough to store a Unicode code point (Java and JavaScript), languages that are/were targeting a platform using UTF-16 (C#), and extensions of a language using UTF-16 (TypeScript). As @bstaletic said, you can't consider that C and C++ are using UTF-8, UTF-16, or any other encoding. In the Ruby case, according to this article, UTF-8 is more popular than other encodings supported by the language, in particular UTF-16. If we look at what recent languages are doing, we see that they tend to pick UTF-8 (e.g. Go and Rust) or UTF-32 (e.g. Python 3). Anyway, I don't think any of this is relevant to the discussion. We are talking about the encoding to use in a protocol, not the best way to represent internally a string in a programming language.

VS Code does handle surrugate pairs correctly :-)

I am not convinced when I see issues like https://github.com/Microsoft/vscode/issues/62286.

whether the programming language uses LE or BE to store the string in UTF-16 in memory has no impact on the position information (character index). This is why it is not mentioned in the spec.

Sure but that's only because the encoding used to transfer the data is not consistent with the offset one.

But programming language usually come with on fixed internal representation.

That's not the case of recent languages like Go and Rust. More importantly, language developers are choosing a fixed internal representation like UTF-32 (UTF-16 when it was still enough to store a Unicode character) to efficiently do operations like computing the length of a string or going through a string character by character without realizing that a character is not necessarily on one Unicode code point which make the optimization worthless (shout-out to the Python developers).

I disagree with the statement that code is mostly written in ASCII. Especially if we take Asia into account.

I'd be interested to see a code base of a popular software (Asian or not) with more than 50% of non-ASCII characters.

Yes, and this is for size reasons.

But why is UTF-8 better than UTF-16 in that regard? Because code is mostly written in ASCII.

As I tried to explain here #376 (comment) these are two orthogonal issues. It is like Java can read the content of a file in memory that is encoded in UTF-8 although its internal string representation is UTF-16.

They are not orthogonal issues. Text is still text whether it's stored in file or in memory (or the internal string representation of a programming language which is the same as storing the data in memory). Using two different encodings for the same data is inconsistent.

XVilka commented 5 years ago

Rather than writing more, I would just leave the good article on why UTF-16 should be abandoned everywhere, even in Windows world: http://utf8everywhere.org/

natebosch commented 5 years ago

Even if you did use some esoteric characters in your code, you will most likely not experience a complete breakdown of the tooling. Instead, more or less the worst thing that happens is that things like the positions of error markers will be of by a few characters occasionally.

With TextDocumentSyncKind.Incremental you can end up with the server having the wrong idea of the content, and then reporting completely inaccurate diagnostics rather than just putting them at the wrong position.

With TextEdit you can insert incorrect content.

kdvolder commented 5 years ago

@natebosch

You are of course right, in theory it is possible. But I think you'd be hard pressed to come up with a real example where you can make the tooling go completely of the rails this way. There is kind of a 'limit' on how wrong the positions can be as the 'errors' reset at the beginning of every new line. They don't accumulate throughout the file.

Anyhow... clearly this is a real issue and needs to be settled/specced properly somehow, but I hardly think its as big a deal as the size of this thread would make one beleave.

ghost commented 4 years ago

This thread is hard to read. Big yak. So perhaps we need a new thread to discuss how and/or which implementations support UTF8. The 'why' has already been done; it is the only format that works for everyone. Looks like clangd already supports UTF8 LSP. I suggest a list. Dont ask me to do it, just here to give advice.

PS. BTW, this is hilarious, UTF16 in this day and age. LOL! Too true to be funny! Like @micbou said, UTF32 would make sense, but this!? Ahahahahahah! I literally burst out laughing when I found out! Zombie land!

XVilka commented 4 years ago

Totally agree, making a list with various LSP servers and clients, split them on those supporting UTF-8, and those who doesn't. Something like I did for TrueColors support in various console emulators: https://github.com/termstandard/colors

microsoft / language-server-protocol

Change character units from UTF-16 code unit to Unicode codepoint #376

Implementations Count

Given the data so far Code points seem like a decent compromise.