Open bfrg opened 2 years ago
So the problem here is that TextDiagnostic.cpp (~line 1240) intentionally introduces an additional 'space' at the beginning of the line for the purposes of trying to make it distinguishable from the diagnostic. However, since 'tabs' work on 'tab stops', it doesn't actually change anything in the line-emit, it only changes the arrow line. I don't think this matters WHERE on the line the tab is (that is, it doesn't have to be 'first', just before the diagnosed characters).
I suspect the solution is going to be to do a 'replace' on the SourceLIne to replace tabs with spaces.
@tahonermann has some ideas of a bunch of OTHER similar issues we probably already have here in addition (that don't just apply with this flag!) thanks to unicode 'full-width' chars, etc. I'll leave it to him to explain.
Thanks for tagging me, Erich. The other issues I had in mind involve combining characters and characters that are historically not displayed as a single width character in a monospace font. I played around a bit and it looks like Clang handles these cases pretty well. I only found one case that caused an alignment problem, though I am sure there are more. Unfortunately, there is currently no specification available to consult, so these cases have to be handled on a best effort basis. The case I found involves U+FDFD (ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM) and it breaks alignment both with and without -fdiagnostics-print-source-range-info
. https://godbolt.org/z/6xnT539fx
The code that Erich referred to is here. I haven't tested, but I suspect inserting a character at the beginning of SourceLine
before the construction of sourceColMap
will suffice to resolve this issue.
Alternatively, another approach would be to, rather than calculating tab stops, ensure a tab is inserted in the caret line for each tab that appears in the source line. Presentation would then be relegated to the terminal, editor, etc... that displays the source and caret lines which, presumably, would then be rendered consistently.
Thanks for tagging me, Erich. The other issues I had in mind involve combining characters and characters that are historically not displayed as a single width character in a monospace font. I played around a bit and it looks like Clang handles these cases pretty well. I only found one case that caused an alignment problem, though I am sure there are more. Unfortunately, there is currently no specification available to consult, so these cases have to be handled on a best effort basis. The case I found involves U+FDFD (ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM) and it breaks alignment both with and without
-fdiagnostics-print-source-range-info
. https://godbolt.org/z/6xnT539fx
I think the width estimation in Clang (actually LLVM) can be improved. For libc++ I'm working on implementing the extended grapheme clusters as specified in [format.string.std]/11. After I've implemented it for libc++ I want to look at improving LLVM's
llvm::sys::unicode::columnWidthUTF8
in a similar fashion.
(Note I don't intend to look at this issue regarding the TABS.)
The caret and squiggles are printed at the wrong position when the file uses tabs for indentation and
-fdiagnostics-print-source-range-info
is enabled.Example:
The byte indexes seem to be correct, it's just the caret and squiggles that are off by one character. It doesn't matter how many levels of indentation, their position is always off by one character. Converting tabs to spaces or omitting the
-fdiagnostics-print-source-range-info
option prints them at the correct position.Tested with latest clang 15: https://wandbox.org/permlink/yMyfaUii4aqbqmLM