odin-lang / Odin

Odin Programming Language
https://odin-lang.org
BSD 3-Clause "New" or "Revised" License
6.13k stars 551 forks source link

Odin diagnostics get messed up by non-ASCII codepoints #3743

Closed Mango0x45 closed 3 weeks ago

Mango0x45 commented 3 weeks ago

Context

$ odin version
odin version dev-2024-02-nightly:539cec74
$ odin report
Where to find more information and get into contact when you encounter a bug:

    Website: https://odin-lang.org
    GitHub:  https://github.com/odin-lang/Odin/issues

Useful information to add to a bug report:

    Odin: dev-2024-02-nightly:539cec74
    OS:   Arch Linux, Linux 6.9.3-arch1-1
    CPU:  11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
    RAM:  15779 MiB

Expected Behavior

When diagnostics are emit, I expect the ^~~~^ range to match the token the compiler is complaining about.

Current Behavior

Currently, the range ASCII-art will overshoot the token if it’s preceded on the line by any non-ASCII codepoint.

Steps to Reproduce

Run odin build foo.odin -file on the following:

package main

main :: proc() {
        City :: struct { population: int }
        växjö := City { "foo" }
}

Failure Logs

$ odin run test.odin -file
/home/thomas/test.odin(5:18) Error: Cannot convert '"foo"' to 'int' from 'untyped string', got "foo"
    växjö := City { "foo" }
                      ^~~~^  Off by two columns!
Mango0x45 commented 3 weeks ago

Looking at the compiler, it seems it uses utf8proc? That means that the correct amount of left-padding should be easy to determine with utf8proc_charwidth()

Feoramund commented 3 weeks ago

Unicode visual width is a tough, potentially unsolvable problem. See the discussion in #3432, particularly these comments https://github.com/odin-lang/Odin/issues/3432#issuecomment-2057519705 and https://github.com/odin-lang/Odin/issues/3432#issuecomment-2057724712.

It's unfortunately not something that can be solved simply by counting codepoints/runes.

Mango0x45 commented 3 weeks ago

It's unfortunately not something that can be solved simply by counting codepoints/runes.

Yes, that's why I mentioned the utf8proc_charwidth() function, which does more than simply count codepoints. While it's impossible to know how wide a codepoint is for sure without having font information, you can reliably get it right about 99% of the time in practice by simply detecting if a codepoint is a combining mark, control character, full width CJK, emoji, etc. It's pretty easy to generate a lookup table to do this, and luckily utf8proc already does it for us.

This is the same approach used by editors like Vim to tell you what screen column you're on, and compilers like GCC to tell you what column an error is at (and to format the error message properly).

Kelimion commented 3 weeks ago

Closed via #3744.