clang: caret and squiggles are sometimes off by one character with -fdiagnostics-print-source-range-info and TABS for indentation

bfrg commented 2 years ago

The caret and squiggles are printed at the wrong position when the file uses tabs for indentation and -fdiagnostics-print-source-range-info is enabled.

Example:

int main(int argc, char* argv[])
{
    int arr[] = {1,2,3,4,5};
    double sum = 0;

    for (int i = 0; i < 5u; ++i)
        sum += i;

    return sum;
}

$ clang -Wall -Wextra -fdiagnostics-print-source-range-info test.c
test.c:3:6: warning: unused variable 'arr' [-Wunused-variable]
        int arr[] = {1,2,3,4,5};
             ^
test.c:1:14: warning: unused parameter 'argc' [-Wunused-parameter]
 int main(int argc, char* argv[])
              ^
test.c:1:26: warning: unused parameter 'argv' [-Wunused-parameter]
 int main(int argc, char* argv[])
                          ^
test.c:6:20:{6:18-6:19}{6:22-6:24}: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
        for (int i = 0; i < 5u; ++i)
                         ~ ^ ~~
4 warnings generated.

The byte indexes seem to be correct, it's just the caret and squiggles that are off by one character. It doesn't matter how many levels of indentation, their position is always off by one character. Converting tabs to spaces or omitting the -fdiagnostics-print-source-range-info option prints them at the correct position.

Tested with latest clang 15: https://wandbox.org/permlink/yMyfaUii4aqbqmLM

erichkeane commented 2 years ago

So the problem here is that TextDiagnostic.cpp (~line 1240) intentionally introduces an additional 'space' at the beginning of the line for the purposes of trying to make it distinguishable from the diagnostic. However, since 'tabs' work on 'tab stops', it doesn't actually change anything in the line-emit, it only changes the arrow line. I don't think this matters WHERE on the line the tab is (that is, it doesn't have to be 'first', just before the diagnosed characters).

I suspect the solution is going to be to do a 'replace' on the SourceLIne to replace tabs with spaces.

erichkeane commented 2 years ago

@tahonermann has some ideas of a bunch of OTHER similar issues we probably already have here in addition (that don't just apply with this flag!) thanks to unicode 'full-width' chars, etc. I'll leave it to him to explain.

tahonermann commented 2 years ago

Thanks for tagging me, Erich. The other issues I had in mind involve combining characters and characters that are historically not displayed as a single width character in a monospace font. I played around a bit and it looks like Clang handles these cases pretty well. I only found one case that caused an alignment problem, though I am sure there are more. Unfortunately, there is currently no specification available to consult, so these cases have to be handled on a best effort basis. The case I found involves U+FDFD (ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM) and it breaks alignment both with and without -fdiagnostics-print-source-range-info. https://godbolt.org/z/6xnT539fx

The code that Erich referred to is here. I haven't tested, but I suspect inserting a character at the beginning of SourceLine before the construction of sourceColMap will suffice to resolve this issue.

Alternatively, another approach would be to, rather than calculating tab stops, ensure a tab is inserted in the caret line for each tab that appears in the source line. Presentation would then be relegated to the terminal, editor, etc... that displays the source and caret lines which, presumably, would then be rendered consistently.

mordante commented 2 years ago

Thanks for tagging me, Erich. The other issues I had in mind involve combining characters and characters that are historically not displayed as a single width character in a monospace font. I played around a bit and it looks like Clang handles these cases pretty well. I only found one case that caused an alignment problem, though I am sure there are more. Unfortunately, there is currently no specification available to consult, so these cases have to be handled on a best effort basis. The case I found involves U+FDFD (ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM) and it breaks alignment both with and without -fdiagnostics-print-source-range-info. https://godbolt.org/z/6xnT539fx

I think the width estimation in Clang (actually LLVM) can be improved. For libc++ I'm working on implementing the extended grapheme clusters as specified in [format.string.std]/11. After I've implemented it for libc++ I want to look at improving LLVM's llvm::sys::unicode::columnWidthUTF8 in a similar fashion.

(Note I don't intend to look at this issue regarding the TABS.)

llvm / llvm-project

clang: caret and squiggles are sometimes off by one character with -fdiagnostics-print-source-range-info and TABS for indentation #54551