`misc-misleading-bidirectional` warns on encoded bidirectional characters

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

29.43k stars 12.16k forks source link

`misc-misleading-bidirectional` warns on encoded bidirectional characters #118498

Open aaronpuchert opened 15 hours ago

aaronpuchert commented 15 hours ago

The following source has no bidirectional characters in the text:

const char bidi[] = "\xe2\x80\xaa";

However, misc-misleading-bidirectional warns that:

> clang-tidy --checks=misc-misleading-bidirectional test.cpp --
1 warning generated.
[...]/test.cpp:1:21: warning: string literal contains misleading bidirectional Unicode characters [misc-misleading-bidirectional]
    1 | const char bidi[] = "\xe2\x80\xaa";
      |                     ^

This is likely because the check runs on the preprocessed source, where the \x escapes have been transformed. However, I don't think this is where we should warn, since the actual source as written is not misleading at all.

aaronpuchert commented 15 hours ago

CC @serge-sans-paille.

llvmbot commented 15 hours ago

@llvm/issue-subscribers-clang-tidy

Author: Aaron Puchert (aaronpuchert)

The following source has no bidirectional characters in the text: ```c++ const char bidi[] = "\xe2\x80\xaa"; ``` However, `misc-misleading-bidirectional` warns that: ``` > clang-tidy --checks=misc-misleading-bidirectional test.cpp -- 1 warning generated. [...]/test.cpp:1:21: warning: string literal contains misleading bidirectional Unicode characters [misc-misleading-bidirectional] 1 | const char bidi[] = "\xe2\x80\xaa"; | ^ ``` This is likely because the check runs on the preprocessed source, where the `\x` escapes have been transformed. However, I don't think this is where we should warn, since the actual source as written is not misleading at all.

aaronpuchert commented 14 hours ago

This is probably not easy to fix. As a pointer, StringLiteral::getLocationOfByte uses Lexer::LexFromRawLexer to re-lex the token. This could be used here as well: instead of StringLiteral::getBytes iterate over the constituent tokens via tokloc_{begin,end} and re-lex them to get their length. Then work on the original source as provided by SourceManager::getBufferData.

However, I wonder if this warning wouldn't be better implemented right in the Lexer itself. We already emit several Unicode warnings.