llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.13k stars 11.11k forks source link

clang::SourceRange of clang::RawComment #91311

Open T-Gruber opened 2 months ago

T-Gruber commented 2 months ago

While working with the LibTooling library (llvm-project release 17.x), I noticed a strange behaviour of the SourceRange related to RawComments. I have written a small clang-based tool that essentially has the task of detecting all comments in a given C file and removing them. For this I get the SourceRange of all RawComments and remove them using the rewriter.

inline void removeAllComments(clang::ASTContext &Context, clang::Rewriter &R) {
  const clang::SourceManager &SrcMgr = Context.getSourceManager();

  if (const std::map<unsigned int, clang::RawComment *> *CommentMap =
          Context.Comments.getCommentsInFile(SrcMgr.getMainFileID()))
    for (auto [LineInfo, Comment] : *CommentMap) {
      R.RemoveText(Comment->getSourceRange());
      std::cout << "SourceRange via Lexer: "
                << clang::Lexer::getSourceText(
                       clang::CharSourceRange::getTokenRange(Comment->getSourceRange()),
                       SrcMgr, Context.getLangOpts(), 0).str()
                << "\nRawTest: " << Comment->getRawText(SrcMgr).str() << "\n\n";
    }
}

To run the standalone tool I use the following command:

$ bin/comment_tool test.c --extra-arg=-fparse-all-comments  --

I tested a short code snippet:

int a = 1 /*comment*/;
extern int /*comment*/ b;

And got the following rewritten result:

int a = 1 
extern int  b;

As you can see, all comments are removed, but the semicolon is also deleted in the first VarDecl. If you take a closer look at the output in the terminal, you can see that the SourceRange of the first comment includes the semicolon. However, the RawTest corresponds to my expectation:

SourceRange via Lexer: /*comment*/;
RawTest: /*comment*/

SourceRange via Lexer: /*comment*/
RawTest: /*comment*/

Is this intentional behaviour? I would be grateful for any advice!

llvmbot commented 2 months ago

@llvm/issue-subscribers-clang-frontend

Author: None (T-Gruber)

While working with the LibTooling library (llvm-project release 17.x), I noticed a strange behaviour of the SourceRange related to RawComments. I have written a small clang-based tool that essentially has the task of detecting all comments in a given C file and removing them. For this I get the SourceRange of all RawComments and remove them using the rewriter. ```cpp inline void removeAllComments(clang::ASTContext &Context, clang::Rewriter &R) { const clang::SourceManager &SrcMgr = Context.getSourceManager(); if (const std::map<unsigned int, clang::RawComment *> *CommentMap = Context.Comments.getCommentsInFile(SrcMgr.getMainFileID())) for (auto [LineInfo, Comment] : *CommentMap) { R.RemoveText(Comment->getSourceRange()); std::cout << "SourceRange via Lexer: " << clang::Lexer::getSourceText( clang::CharSourceRange::getTokenRange(Comment->getSourceRange()), SrcMgr, Context.getLangOpts(), 0).str() << "\nRawTest: " << Comment->getRawText(SrcMgr).str() << "\n\n"; } } ``` To run the standalone tool I use the following command: ```console $ bin/comment_tool test.c --extra-arg=-fparse-all-comments -- ``` I tested a short code snippet: ```C int a = 1 /*comment*/; extern int /*comment*/ b; ``` And got the following rewritten result: ```C int a = 1 extern int b; ``` As you can see, all comments are removed, but the semicolon is also deleted in the first VarDecl. If you take a closer look at the output in the terminal, you can see that the SourceRange of the first comment includes the semicolon. However, the RawTest corresponds to my expectation: ``` SourceRange via Lexer: /*comment*/; RawTest: /*comment*/ SourceRange via Lexer: /*comment*/ RawTest: /*comment*/ ``` Is this intentional behaviour? I would be grateful for any advice!