Black does not format correctly strings containing emojis

microsoft / vscode-black-formatter

Formatting support for Python using the Black formatter

https://marketplace.visualstudio.com/items?itemName=ms-python.black-formatter

MIT License

150 stars 35 forks source link

Black does not format correctly strings containing emojis #395

Closed DavideCanton closed 9 months ago

DavideCanton commented 10 months ago

It seems that this extension does not handle correctly formatting strings containing emojis when formatting from vscode. Formatting the file with the black command via cli works correctly.

Examples:

before

s = '😊'
x = 'foo'

after

s = ""'
x = "foo"

s = '🤵 hi!'

after

s = "🤵 hi"'

Investigating a bit, it seems that the positions sent back from the language server to the client count the emojis as a single character while the visual studio code editor doesn't, as for the first example the line edits relative to the line 0 are 0:4-0:5, new_text='"' and 0:6-0:7, new_text='"', but the second one seems to replace something other than the second '. Tampering with the server code and setting 0:7-0:8 to the second edit fixes the problem.

Tested on Windows 10.

DavideCanton commented 10 months ago

This may be related to the fact that in python assert len("😊") == 1 while in javascript (that uses utf-16) "😊".length == 2.

Javascript recommends converting the string into an array to retrieve the number of characters (see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length#strings_with_length_not_equal_to_the_number_of_characters).

I don't know it it's easier to return from the python server the adjusted offsets by encoding the string into utf-16 (also taking the correct endianness into account) or to make vscode position API unicode aware (probably the first).

karthiknadig commented 9 months ago

@DavideCanton Can you share the logs you captured? Also, is the file encoding UTF-8?

karthiknadig commented 9 months ago

@DavideCanton I think this has to be handled in python itself. Looks like the general specification is to handle it based on the positionEncoding field passed in the initial configuration : https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments

We calculate text edits here: https://github.com/microsoft/vscode-black-formatter/blob/main/bundled/tool/lsp_edit_utils.py The update would be to pass in value of positionEncoding, encode the string based on that, then calculate the diffs.

Tests can be added here: https://github.com/microsoft/vscode-black-formatter/blob/main/src/test/python_tests/test_edit_utils.py

DavideCanton commented 9 months ago

Hi @karthiknadig , thanks for the response! I can confirm that the encoding of the document was set on UTF-8 in all the examples.

I didn't capture any log, but I can definitely do if you can give me an hint on what do you need.

karthiknadig commented 9 months ago

In my previous comment I have tried to summarize the problem (you identified the core issue correctly). Basically, we need to use UTF-16 to calculate the range offsets but return UTF-8 strings.

/cc @dbaeumer is that accurate?

dbaeumer commented 9 months ago

What needs to be utf-8 encoded is the stringified result JSON return from the request. So the strings that you have when you build up the result JSON can be encoded in the way your programming language wants.

karthiknadig commented 9 months ago

/cc @charliermarsh This is something to note for ruff diff ranges.

crwilcox commented 9 months ago

I was going to open a separate bug, but I feel it's possible this is related. I had added a comment following a line printing an f string, which happened to also have emojis. Running python3 -m black file.py works correctly

Untitled video

DavideCanton commented 9 months ago

@crwilcox yeah, it seems kinda the same

wanderingstan commented 9 months ago

I was about to file similar bug. The formatter introduces invalid strings when formatting lines with emojis.

Here is screencast of it in action in vscode:

python format fail

My initial thought was that it might be related to https://github.com/psf/black/issues/1197

DavideCanton commented 9 months ago

My initial thought was that it might be related to psf/black#1197

Have you tried running black on the source code via cli? It shouldn't happen

wanderingstan commented 9 months ago

Have you tried running black on the source code via cli? It shouldn't happen

Just tried it and can confirm the issue does not exist when calling from the cli.

Given this, as a workaround for VS Code until this bug is fixed we could use a keyboard shortcut to run black from the cli. (via this SO)

Open Preferences: Open Keyboard Shortcuts (JSON)) and paste in this:

// Place your key bindings in this file to override the defaults
[

    {
        "key": "ctrl+shift+f",
        "command": "workbench.action.terminal.sendSequence",
        "args": { "text": "black '${file}'\u000D" }
      }

]

Ahmet-Dedeler commented 9 months ago

I've experienced this emoji problem multiple times too, would really be helpful to get a fix for this

karthiknadig commented 9 months ago

I have a potential fix for this, please give this a try: https://github.com/microsoft/vscode-black-formatter/actions/runs/7426466173/artifacts/1150857594

DavideCanton commented 9 months ago

@karthiknadig it seems to be working fine now. Thanks!

karthiknadig commented 8 months ago

Verification steps:

Install the latest pre-release
Create something.py:
```
s = '😊'
```
Trigger formatting with black using Format document with.
It should update to:
```
s = "😊"
```

JokerQyou commented 8 months ago

So when will this fix be released to the extension marketplace?

karthiknadig commented 8 months ago

@JokerQyou This is in pre-release version.