Clarify measurement unit in CompletionsRequest.column

rillig commented 2 years ago

Is the column measured in:

bytes
UTF-16 code units
UTF-32 code units
visible characters (whatever a "character" may be)

Is 'column' even a good word, or should it rather be 'index'? Is this column 1-based (depending on the initial handshake), or is it always 0-based?

connor4312 commented 2 years ago

The base should be determined by the initial handshake.

I believe column numbers, here and throughout the protocol, should be treated as UTF-8 code units. @weinand @roblourens thoughts?

rillig commented 2 years ago

should be treated as UTF-8 code points

There is no such concept as "UTF-8 code points".

There's only a Unicode code point, which is an integer number from the range 0 to 0x10_FFFF. Such a code point has a name and a plethora of other properties. It does not have an encoding though.

UTF-8 is one possible encoding that encodes a single Unicode code point as a sequence of numbers from the range 0 to 0xFF. Each Unicode encoding is based on a code unit, which is:

uint8_t for UTF-8
uint16_t for UTF-16
uint32_t for UTF-32

Mixing the abstract numbers and their encoding doesn't make sense.

Instead, choose a measurement unit from the below list:

UTF-32 code units
UTF-16 code units (the "natural" storage unit of Java and JavaScript)
UTF-8 code units (the "natural" storage unit of Go)
Visible columns in a fixed-width font (this one would get very messy)

connor4312 commented 2 years ago

Thanks for the correction, code ~~points~~ units

weinand commented 2 years ago

@connor4312 we should use the same as LSP. Encodings are irrelevant here. DAP returns completion proposals as strings, columns denote the individual chracters within a string. Renaming "column" is not an option.

connor4312 commented 2 years ago

On LSP:

Prior to 3.17 the offsets were always based on a UTF-16 string representation. So a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16.

Since 3.17 clients and servers can agree on a different string encoding representation (e.g. UTF-8). The client announces it’s supported encoding via the client capability general.positionEncodings. The value is an array of position encodings the client supports, with decreasing preference (e.g. the encoding at index 0 is the most preferred one).

To stay backwards compatible the only mandatory encoding is UTF-16 represented via the string utf-16. The server can pick one of the encodings offered by the client and signals that encoding back to the client via the initialize result’s property capabilities.positionEncoding. If the string value utf-16 is missing from the client’s capability general.positionEncodings servers can safely assume that the client supports UTF-16. If the server omits the position encoding in its initialize result the encoding defaults to the string value utf-16. Implementation considerations: since the conversion from one encoding into another requires the content of the file / line the conversion is best done where the file is read which is usually on the server side.

I suspect that implementors are doing things differently here. It looks like VS Code is using UTF-16 columns in the REPL at least, as completing on 𐐀 results in a completions request at the 1-based column 3.

Encoding is relevant since, as in this case, an individual character might span more than a single "column" depending on its width and the encoding.

puremourning commented 2 years ago

@connor4312 we should use the same as LSP.

Encodings are irrelevant here. DAP returns completion proposals as strings, columns denote the individual chracters within a string.

Renaming "column" is not an option.

Doing what LSP does is super controversial. And not ideal.

https://github.com/microsoft/language-server-protocol/issues/376

roblourens commented 2 years ago

vscode's implementation would be however the editor describes columns, which I think would be using UTF-16 code units since that's what you get in JS. Maybe we need this in the spec.

connor4312 commented 2 years ago

If we're okay spec'ing something different than what VS Code currently implements in DAP, I would prefer to normalize on UTF-8 units as a more 'modern' and conceptually simpler standard.

roblourens commented 2 years ago

If we're going to change it, we would have to add the same infrastructure for a client/server agreeing on what to use, and that doesn't seem worth it.

weinand commented 2 years ago

In the context of the completions request it only makes sense that the column argument represents "Unicode code points", since there is no need to address individual bytes or words within the various UTF encodings.

Gedankenexperiment: instead of using a column argument we could use another string prefixText argument to denominate a location within the text. In that case there would be no need to talk about encodings, right?

rillig commented 2 years ago

Gedankenexperiment: instead of using a column argument we could use another string prefixText argument to denominate a location within the text. In that case there would be no need to talk about encodings, right?

That's an elegant approach. For further symmetry, it might be possible to split the text into prefixText and suffixText, thereby avoiding to repeat part of the text. If the suffixText is expected to be empty in most cases, it can also be made optional.

roblourens commented 2 years ago

That could help for completions, but every request that talks about columns could still have the confusion, right?

weinand commented 2 years ago

Of course, the "thought experiment" was not meant as a suggestion to change "column" to "prefixText" (because that would be a breaking change and we would have to introduce new "capabilities").

I tried to explain that if a string prefixText would be an acceptable replacement for column and the equalityprefixText == text.substring(0, column) holds, then we know the measurement of column.

puremourning commented 2 years ago

Long story short, we've discussed this before and it was stated that it was UTF-16 code units: I asked this before and @weinand said it's the same as LSP, which is UTF-16 code units: https://github.com/microsoft/debug-adapter-protocol/issues/91#issuecomment-577136145. I raised this: : https://github.com/puremourning/vimspector/issues/96 at the time.

Fortunately, I haven't actually implemented vimspector/issue/96 so I'm open to changing it, but obviously I can't speak for other clients.

then we know the measurement of column

Kinda, if and only if we can define what len( string ) means and thus what substring_of( string, 0, column ) means. This being the crux of the challenge - in c, offsets and strlen() are going to work on bytes. IIRC it somewhat differs depending on implementation language even for languages that "natively" use unicode strings. Examples:

Java: String.length() returns the number of char values (i.e. code units, I believe UTF-16)
Python: len() of str returns number of codepoints (not code units in some encoding)
Javascript: String.length returns code units (again, utf-16 ?)

A practical approach is indeed to use the number of codepoints in some specific encoding. As the LSP maintainers have found, this quickly becomes a religious debate. The challenge of specifying it in terms of codepoints is that the whole thing gets complicated when you look at combining marks, grapheme clusters and oh my.

My personal preference is codepoints because my client happens to be written in python, but "utf-8 code units" (byte offset into utf-8 encoded version of the line) seems a popular choice (modulo loud dissenters).

weinand commented 2 years ago

@puremourning, as pointed out by @rillig above, there is no concept of an "encoding specific code point". Do you mean "code units"?

puremourning commented 2 years ago

Yes, 'Number of code units in some specific encoding' is what I should have written.

weinand commented 2 years ago

Since the issue at hand asks for "Clarify measurement unit in CompletionsRequest.column", I first did an assessment of the status quo:

Assessment:

The description of CompletionsRequest.column says:

The character position for which to determine the completion proposals.

Since there is no mentioning of "encodings" or "code units" anywhere in the DAP, we can assume that the original intent of "character" was a high level concept like "visible character". In Unicode terminology this would probably translate at least to "code points" or to the even more abstract "grapheme clusters".

However, real world DAP implementations (both clients and debug adapters) don't implement this high level concept. Instead many of these DAP implementations use programming languages (JS, TS, C#, Java) where in-memory strings are ordered sequences of 16-bit unsigned integer values (UTF-16 code units). It is therefore natural that these implementations interpret "character positions" as offsets into the sequences of UTF-16 code units.

Please note that UTF-16 is only the in-memory encoding (which is relevant when implementing DAP). Sending DAP JSON over a communication mechanism (or storing it on disk) will most likely use UTF-8 encoding, but the decoding is typically done on a different layer and results in in-memory strings that have encoding of the underlying programming language (e.g. UTF-16).

Bottom line:

Today the "de-facto" measurement unit of CompletionsRequest.column is UTF-16 code units and whether it is 0- or 1-based is determined by the intitial handshake (i.e. columnsStartAt1 argument of the initialize request).

Proposal:

The clarification need is not confined to CompletionsRequest.column but applies to the following DAP elements as well:

Events:
    Output
Requests:
    BreakpointLocations
    Completions
    GotoTargets
Types:
    Breakpoint
    BreakpointLocation
    CompletionItem 
    Scope
    SourceBreakpoint
    StackFrame
    StepInTarget

The documentation for these DAP elements will be updated to reflect the "Bottom line" statement from above.

Since the measurement unit of "column" properties was not clearly specified before, it is unclear whether existing clients or debug adapters need to update their implementations. No implementation changes are planned for VS Code and its embedded js-debug.

Whether there is a need for introducing a configurable measurement unit for "column" properties can be discussed in an independent feature request.

What do you think?

microsoft / debug-adapter-protocol