palantir / python-language-server

An implementation of the Language Server Protocol for Python
MIT License
2.61k stars 283 forks source link

UTF-16 column offsets #770

Open rwols opened 4 years ago

rwols commented 4 years ago

While working on https://github.com/sublimelsp/LSP/pull/932, I learned that pyls does not handle UTF-16. It assumes that (row, col) pairs are in terms of UTF-32 points.

For the row it doesn't matter whether we agree on UTF-8, UTF-16 or UTF-32. For the col it does matter. Consider this line in a python file:

x="😋"

The closing " has a column offset of 5, not 4:

   x  =  "  😋  "
0  1  2  3  45  6      (UTF-16)

If we add some characters to the string, say:

x="😋x"

then an incremental text sync notification will notify pyls that an x has been added at offset 5. This greatly confuses and breaks pyls.

The relevant code for handling incremental text sync looks to be here: https://github.com/palantir/python-language-server/blob/21833eafd4e254a79d35ca539967f3a010e8c2a1/pyls/workspace.py#L150-L191

Looks like

new.write(line[:start_col])

and

new.write(line[end_col:])

need more attention for UTF-16 to work.

ccordoba12 commented 4 years ago

Thanks for noticing this problem. You are welcome to submit a pull request to fix this.

rwols commented 4 years ago

I've checked out this repository, but there are more problems on the horizon. Everywhere where pyls takes a position or a range in the request method, we should do the UTF-16 conversion.

From what I understand all "hooks" expect the position and range to be in UTF-32. So there needs to be a translation layer somewhere. I don't have enough of an overview of this codebase to make that decision.

I might be able to hack in a fix for the incremental text sync, but larger changes are needed.

For reference, clangd handles it like this:

https://github.com/llvm/llvm-project/blob/b632bd88a633c84eb2ce8f999119bc4e6c1ee98c/clang-tools-extra/clangd/SourceCode.cpp#L51-L82

rwols commented 4 years ago

By the way, if all source code is ASCII (most common case), then the problem won't show itself.

rwols commented 4 years ago

Correction: if the source code contains characters from the basic multilingual plane, then UTF-16 can encode that in one code point, so those characters will work by accident too.