Closed rillig closed 2 years ago
The base should be determined by the initial handshake.
I believe column numbers, here and throughout the protocol, should be treated as UTF-8 code units. @weinand @roblourens thoughts?
should be treated as UTF-8 code points
There is no such concept as "UTF-8 code points".
There's only a Unicode code point, which is an integer number from the range 0 to 0x10_FFFF. Such a code point has a name and a plethora of other properties. It does not have an encoding though.
UTF-8 is one possible encoding that encodes a single Unicode code point as a sequence of numbers from the range 0 to 0xFF. Each Unicode encoding is based on a code unit, which is:
uint8_t
for UTF-8uint16_t
for UTF-16uint32_t
for UTF-32Mixing the abstract numbers and their encoding doesn't make sense.
Instead, choose a measurement unit from the below list:
Thanks for the correction, code points units
@connor4312 we should use the same as LSP. Encodings are irrelevant here. DAP returns completion proposals as strings, columns denote the individual chracters within a string. Renaming "column" is not an option.
On LSP:
Prior to 3.17 the offsets were always based on a UTF-16 string representation. So a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16.
Since 3.17 clients and servers can agree on a different string encoding representation (e.g. UTF-8). The client announces it’s supported encoding via the client capability general.positionEncodings. The value is an array of position encodings the client supports, with decreasing preference (e.g. the encoding at index 0 is the most preferred one).
To stay backwards compatible the only mandatory encoding is UTF-16 represented via the string utf-16. The server can pick one of the encodings offered by the client and signals that encoding back to the client via the initialize result’s property capabilities.positionEncoding. If the string value utf-16 is missing from the client’s capability general.positionEncodings servers can safely assume that the client supports UTF-16. If the server omits the position encoding in its initialize result the encoding defaults to the string value utf-16. Implementation considerations: since the conversion from one encoding into another requires the content of the file / line the conversion is best done where the file is read which is usually on the server side.
I suspect that implementors are doing things differently here. It looks like VS Code is using UTF-16 columns in the REPL at least, as completing on 𐐀
results in a completions
request at the 1-based column 3.
Encoding is relevant since, as in this case, an individual character might span more than a single "column" depending on its width and the encoding.
@connor4312 we should use the same as LSP.
Encodings are irrelevant here. DAP returns completion proposals as strings, columns denote the individual chracters within a string.
Renaming "column" is not an option.
Doing what LSP does is super controversial. And not ideal.
https://github.com/microsoft/language-server-protocol/issues/376
vscode's implementation would be however the editor describes columns, which I think would be using UTF-16 code units since that's what you get in JS. Maybe we need this in the spec.
If we're okay spec'ing something different than what VS Code currently implements in DAP, I would prefer to normalize on UTF-8 units as a more 'modern' and conceptually simpler standard.
If we're going to change it, we would have to add the same infrastructure for a client/server agreeing on what to use, and that doesn't seem worth it.
In the context of the completions
request it only makes sense that the column
argument represents "Unicode code points", since there is no need to address individual bytes or words within the various UTF encodings.
Gedankenexperiment: instead of using a column
argument we could use another string prefixText
argument to denominate a location within the text
. In that case there would be no need to talk about encodings, right?
Gedankenexperiment: instead of using a
column
argument we could use another stringprefixText
argument to denominate a location within thetext
. In that case there would be no need to talk about encodings, right?
That's an elegant approach. For further symmetry, it might be possible to split the text
into prefixText
and suffixText
, thereby avoiding to repeat part of the text. If the suffixText
is expected to be empty in most cases, it can also be made optional.
That could help for completions, but every request that talks about columns could still have the confusion, right?
Of course, the "thought experiment" was not meant as a suggestion to change "column" to "prefixText" (because that would be a breaking change and we would have to introduce new "capabilities").
I tried to explain that if a string prefixText
would be an acceptable replacement for column
and the equalityprefixText == text.substring(0, column)
holds, then we know the measurement of column
.
Long story short, we've discussed this before and it was stated that it was UTF-16 code units: I asked this before and @weinand said it's the same as LSP, which is UTF-16 code units: https://github.com/microsoft/debug-adapter-protocol/issues/91#issuecomment-577136145. I raised this: : https://github.com/puremourning/vimspector/issues/96 at the time.
Fortunately, I haven't actually implemented vimspector/issue/96 so I'm open to changing it, but obviously I can't speak for other clients.
then we know the measurement of column
Kinda, if and only if we can define what len( string )
means and thus what substring_of( string, 0, column )
means. This being the crux of the challenge - in c, offsets and strlen()
are going to work on bytes. IIRC it somewhat differs depending on implementation language even for languages that "natively" use unicode strings. Examples:
A practical approach is indeed to use the number of codepoints in some specific encoding. As the LSP maintainers have found, this quickly becomes a religious debate. The challenge of specifying it in terms of codepoints is that the whole thing gets complicated when you look at combining marks, grapheme clusters and oh my.
My personal preference is codepoints because my client happens to be written in python, but "utf-8 code units" (byte offset into utf-8 encoded version of the line) seems a popular choice (modulo loud dissenters).
@puremourning, as pointed out by @rillig above, there is no concept of an "encoding specific code point". Do you mean "code units"?
Yes, 'Number of code units in some specific encoding' is what I should have written.
Since the issue at hand asks for "Clarify measurement unit in CompletionsRequest.column", I first did an assessment of the status quo:
The description of CompletionsRequest.column
says:
The character position for which to determine the completion proposals.
Since there is no mentioning of "encodings" or "code units" anywhere in the DAP, we can assume that the original intent of "character" was a high level concept like "visible character". In Unicode terminology this would probably translate at least to "code points" or to the even more abstract "grapheme clusters".
However, real world DAP implementations (both clients and debug adapters) don't implement this high level concept. Instead many of these DAP implementations use programming languages (JS, TS, C#, Java) where in-memory strings are ordered sequences of 16-bit unsigned integer values (UTF-16 code units). It is therefore natural that these implementations interpret "character positions" as offsets into the sequences of UTF-16 code units.
Please note that UTF-16 is only the in-memory encoding (which is relevant when implementing DAP). Sending DAP JSON over a communication mechanism (or storing it on disk) will most likely use UTF-8 encoding, but the decoding is typically done on a different layer and results in in-memory strings that have encoding of the underlying programming language (e.g. UTF-16).
Today the "de-facto" measurement unit of CompletionsRequest.column
is UTF-16 code units and whether it is 0- or 1-based is determined by the intitial handshake (i.e. columnsStartAt1
argument of the initialize
request).
The clarification need is not confined to CompletionsRequest.column
but applies to the following DAP elements as well:
Events:
Output
Requests:
BreakpointLocations
Completions
GotoTargets
Types:
Breakpoint
BreakpointLocation
CompletionItem
Scope
SourceBreakpoint
StackFrame
StepInTarget
The documentation for these DAP elements will be updated to reflect the "Bottom line" statement from above.
Since the measurement unit of "column" properties was not clearly specified before, it is unclear whether existing clients or debug adapters need to update their implementations. No implementation changes are planned for VS Code and its embedded js-debug.
Whether there is a need for introducing a configurable measurement unit for "column" properties can be discussed in an independent feature request.
What do you think?
Is the column measured in:
Is 'column' even a good word, or should it rather be 'index'? Is this column 1-based (depending on the initial handshake), or is it always 0-based?