Emoji in comment makes syntax highlight break for the last character

sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody

https://sourcegraph.com

Other

10.1k stars 1.27k forks source link

Emoji in comment makes syntax highlight break for the last character #46068

Open indradhanush opened 1 year ago

indradhanush commented 1 year ago

Emoji in one line Last character of the line with emoji in the comment renders in white.

No emoji

Renders correctly.

Emoji in all lines All the lines with at least one character after // have the last character in white.

Two emojis make two characters go white in the end

Thanks to @pjlast!

One more to verify @pjlast's hunch Two emojis at the start make three characters in the end go white. This is because the first emoji has a skin tone and takes double the bytes of a regular emoji.

pjlast commented 1 year ago

My hunch is that the line length is calculated with a method that treats an emoji as a single character, but then when the highlight is applied the emojis count as multiple characters (as they are multiple u8s in size). This leads to the emojis "swallowing" some of the highlighting.

Just a hunch

varungandhi-src commented 6 months ago

Related PR that was closed: https://github.com/sourcegraph/sourcegraph/pull/49753

kritzcreek commented 6 months ago

Here's a link to dotcom that easily reproduces the issue. It's much easier to see when using the dark theme: https://sourcegraph.com/github.com/sourcegraph/sourcegraph@67b2c665f1746793d658f43bbfb1b03fca41cbaa/-/blob/internal/scim/init.go?L72:6-72:7

varungandhi-src commented 6 months ago

Rough implementation sketch to avoid regressing perf:

Iterate over lines in the file, check if each line is ASCII or not (the current implementation in the Rust stdlib checks 1 usize at a time). If the line is not very long (set a limit here), add the line numbers with non-ASCII characters to HashSet.
If the HashSet is non-empty,
- Construct a LineIndex for the contents. This is also used by rust-analyzer for LSP.
- Do one pass over the SCIP index and modify column numbers (computed via LineIndex's to_wide method) if the line belongs to HashSet.

We should make sure to set the position_encoding field on the Document.

The bulk of the changes should not be tied to the core highlighting logic.

kritzcreek commented 6 months ago

Just for clarification: You're proposing to reencode all document offsets into UTF-16 code units, because these are easier to consume from JS, right? I'm asking because LineIndex from rust analyzer is generic in its encoding and to_wide can also emit code points (utf-32). I think we could make an argument for exposing both utf-16 or utf-32 code units via the API.

That won't be free, so regressing perf is unavoidable, but I agree with your implementation sketch in terms of minimizing the perf impact.

varungandhi-src commented 6 months ago

You're proposing is to reencode all document offsets into UTF-16 code units, because these are easier to consume from JS, right?

Yes. Albeit, in the common case, it will be mostly "free" (only the is_ascii checks).

I think we could make an argument for exposing both utf-16 or utf-32 code units via the API.

-1, we shouldn't expose stuff for which we don't have downstream users. We already have too much of that in other places.