sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.27k forks source link

Emoji in comment makes syntax highlight break for the last character #46068

Open indradhanush opened 1 year ago

indradhanush commented 1 year ago

Emoji in one line Last character of the line with emoji in the comment renders in white.

image

No emoji

Renders correctly.

image

Emoji in all lines All the lines with at least one character after // have the last character in white.

image

Two emojis make two characters go white in the end

Thanks to @pjlast!

image

One more to verify @pjlast's hunch Two emojis at the start make three characters in the end go white. This is because the first emoji has a skin tone and takes double the bytes of a regular emoji.

image

pjlast commented 1 year ago

My hunch is that the line length is calculated with a method that treats an emoji as a single character, but then when the highlight is applied the emojis count as multiple characters (as they are multiple u8s in size). This leads to the emojis "swallowing" some of the highlighting.

Just a hunch

varungandhi-src commented 6 months ago

Related PR that was closed: https://github.com/sourcegraph/sourcegraph/pull/49753

kritzcreek commented 6 months ago

Here's a link to dotcom that easily reproduces the issue. It's much easier to see when using the dark theme: https://sourcegraph.com/github.com/sourcegraph/sourcegraph@67b2c665f1746793d658f43bbfb1b03fca41cbaa/-/blob/internal/scim/init.go?L72:6-72:7

varungandhi-src commented 6 months ago

Rough implementation sketch to avoid regressing perf:

We should make sure to set the position_encoding field on the Document.

The bulk of the changes should not be tied to the core highlighting logic.

kritzcreek commented 6 months ago

Just for clarification: You're proposing to reencode all document offsets into UTF-16 code units, because these are easier to consume from JS, right? I'm asking because LineIndex from rust analyzer is generic in its encoding and to_wide can also emit code points (utf-32). I think we could make an argument for exposing both utf-16 or utf-32 code units via the API.

That won't be free, so regressing perf is unavoidable, but I agree with your implementation sketch in terms of minimizing the perf impact.

varungandhi-src commented 6 months ago

You're proposing is to reencode all document offsets into UTF-16 code units, because these are easier to consume from JS, right?

Yes. Albeit, in the common case, it will be mostly "free" (only the is_ascii checks).

I think we could make an argument for exposing both utf-16 or utf-32 code units via the API.

-1, we shouldn't expose stuff for which we don't have downstream users. We already have too much of that in other places.