Closed DavideCanton closed 9 months ago
This may be related to the fact that in python assert len("😊") == 1
while in javascript (that uses utf-16) "😊".length == 2
.
Javascript recommends converting the string into an array to retrieve the number of characters (see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length#strings_with_length_not_equal_to_the_number_of_characters).
I don't know it it's easier to return from the python server the adjusted offsets by encoding the string into utf-16 (also taking the correct endianness into account) or to make vscode position API unicode aware (probably the first).
@DavideCanton Can you share the logs you captured? Also, is the file encoding UTF-8?
@DavideCanton I think this has to be handled in python itself. Looks like the general specification is to handle it based on the positionEncoding
field passed in the initial configuration : https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
We calculate text edits here: https://github.com/microsoft/vscode-black-formatter/blob/main/bundled/tool/lsp_edit_utils.py
The update would be to pass in value of positionEncoding
, encode the string based on that, then calculate the diffs.
Tests can be added here: https://github.com/microsoft/vscode-black-formatter/blob/main/src/test/python_tests/test_edit_utils.py
Hi @karthiknadig , thanks for the response! I can confirm that the encoding of the document was set on UTF-8 in all the examples.
I didn't capture any log, but I can definitely do if you can give me an hint on what do you need.
In my previous comment I have tried to summarize the problem (you identified the core issue correctly). Basically, we need to use UTF-16 to calculate the range offsets but return UTF-8 strings.
/cc @dbaeumer is that accurate?
What needs to be utf-8
encoded is the stringified result JSON return from the request. So the strings that you have when you build up the result JSON can be encoded in the way your programming language wants.
/cc @charliermarsh This is something to note for ruff
diff ranges.
I was going to open a separate bug, but I feel it's possible this is related. I had added a comment following a line printing an f string, which happened to also have emojis. Running python3 -m black file.py
works correctly
@crwilcox yeah, it seems kinda the same
I was about to file similar bug. The formatter introduces invalid strings when formatting lines with emojis.
Here is screencast of it in action in vscode:
My initial thought was that it might be related to https://github.com/psf/black/issues/1197
My initial thought was that it might be related to psf/black#1197
Have you tried running black on the source code via cli? It shouldn't happen
Have you tried running black on the source code via cli? It shouldn't happen
Just tried it and can confirm the issue does not exist when calling from the cli.
Given this, as a workaround for VS Code until this bug is fixed we could use a keyboard shortcut to run black from the cli. (via this SO)
Open Preferences: Open Keyboard Shortcuts (JSON)) and paste in this:
// Place your key bindings in this file to override the defaults
[
{
"key": "ctrl+shift+f",
"command": "workbench.action.terminal.sendSequence",
"args": { "text": "black '${file}'\u000D" }
}
]
I've experienced this emoji problem multiple times too, would really be helpful to get a fix for this
I have a potential fix for this, please give this a try: https://github.com/microsoft/vscode-black-formatter/actions/runs/7426466173/artifacts/1150857594
@karthiknadig it seems to be working fine now. Thanks!
Verification steps:
something.py
:
s = '😊'
s = "😊"
So when will this fix be released to the extension marketplace?
@JokerQyou This is in pre-release version.
It seems that this extension does not handle correctly formatting strings containing emojis when formatting from vscode. Formatting the file with the black command via cli works correctly.
Examples:
before
after
or
after
Investigating a bit, it seems that the positions sent back from the language server to the client count the emojis as a single character while the visual studio code editor doesn't, as for the first example the line edits relative to the line 0 are
0:4-0:5, new_text='"'
and0:6-0:7, new_text='"'
, but the second one seems to replace something other than the second'
. Tampering with the server code and setting0:7-0:8
to the second edit fixes the problem.Tested on Windows 10.