llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.43k stars 11.75k forks source link

Assertion in CommentParser.cpp with -Wdocumentation #21628

Closed wjristow closed 9 years ago

wjristow commented 10 years ago
Bugzilla Link 21254
Resolution FIXED
Resolved on Oct 16, 2014 06:20
Version unspecified
OS Windows NT
CC @majnemer,@DougGregor,@gribozavr

Extended Description

The test-case, "test.c" appended below, hits an assertion-failure when '-Weverything' is used (more specifically, it's the warning '-Wdocumentation' that needs to be enabled to cause the assertion-failure.)

Appears to happen irrespective of which target is specified or other switches (e.g., -O, -g, etc.) Tested with:

$ clang --version clang version 3.6.0 (trunk 219524) Target: x86_64-unknown-linux-gnu Thread model: posix $ clang -c -Wdocumentation test.c test.c:30:27: warning: unknown command tag name 'in'; did you mean 'fn'? [-Wdocumentation]

"test.c" below


// ============================================================================ /**

llvmbot commented 9 years ago

Fixed in r219802. The buildbots seem happy with the change.

llvmbot commented 9 years ago

Updating NumInlineCommandCommentBits accordingly should be sufficient.

I have now sent a patch for review.

I've updated the Num... entries and I've also provided a definition of the number of bits in the ID field (via an enum, like in Comment) so that we don't have to use the same literal integer in more than one place.

gribozavr commented 9 years ago

Updating NumInlineCommandCommentBits accordingly should be sufficient.

llvmbot commented 9 years ago

I wanted to use it for the test, so I could dump the registered commands, but now I guess I have to find the bug in that tool too. :-)

And I found it, the problem is actually in clang/include/AST/Comment.h.

Class Comment has an anonymous union of structs that are only containing bitfields. Some of these have a CommandID bitfield, whose width is 8. Of course I've tried to extend it... ... but then the structs are embedded in other structs, and the entire bitfield package is embedded in another bitfield whose type is unsigned int, and of course the size goes over the maximum size of an unsigned int.

This means that we can't arbitrarily change the size of that bitfield without re-engineering those structs. Why are we using bitfields for this? Do we really need to pack those structs in as few bits as possible? I don't think we'll have millions of Comment instances, maybe we don't need to optimize memory that much...

Anyway, I can try to change CommandID to a reasonable size that doesn't go over the budget, but this might only solve our issue for small numbers. We'll still have an upper limit for the number of commands.

llvmbot commented 9 years ago

Awesome! Thank you.

In the process of writing a test, I have also found out that although the clang functionality now works, "c-index-test" still prints out the 'wrapped around' command codes if I run it on the provided test. So there must be another narrow data field somewhere in that tool.

I wanted to use it for the test, so I could dump the registered commands, but now I guess I have to find the bug in that tool too. :-)

gribozavr commented 9 years ago

Awesome! Thank you.

llvmbot commented 9 years ago

Thank you for investigating this. The idea behind this handling of unknown commands is to allow third-party clients of Clang as a library to handle their own commands, possibly without even defining anything or modifying Clang source.

Hm... yes, that makes sense. In which case it's even more important to deal with the bitfield problem, because a Clang client could very well add more than 256 commands.

It would be interesting to understand why increasing the width of the bitfield causes failures somewhere else. There is TableGen code that knows the layout of this struct, but the bitfield width should not matter

It turns out I had made a silly mistake and extended the bitfield too much... I just wanted to quickly test my hypothesis so I set the width to a large number which was too much for an unsigned int. It made the test case work, but yeah, silly me.

If I actually remove the bitfield specifier, and leave the ID field as an unsigned int (like all the arguments of those functions using CommandIDs) everything seems to work.

I'll post a patch for review soon.

gribozavr commented 9 years ago

Dario,

Thank you for investigating this. The idea behind this handling of unknown commands is to allow third-party clients of Clang as a library to handle their own commands, possibly without even defining anything or modifying Clang source.

It would be interesting to understand why increasing the width of the bitfield causes failures somewhere else. There is TableGen code that knows the layout of this struct, but the bitfield width should not matter:

/// When reordering, adding or removing members please update the corresponding /// TableGen backend.

llvmbot commented 9 years ago

I think I found what the problem is, and I may have a fix, but it needs a bit more work and double-checking before I can submit a patch.

TL;DR explanation:

The fix would be to increase the width of the CommandInfo::ID field to something sensible. However, if I do so, a few tests start to fail.

--- DETAILED EXPLANATION ---

The CommentLexer can recognize Doxygen-style documentation comments and create tokens for those; the CommentParser then acts on them to perform checks such as identifying Doxygen syntax errors or unknown commands. It is also possible to define custom commands via a clang option, so that the lexer and the parser will recognize them.

This "custom commands" system however is also used for something which (IMHO) it shouldn't be used for. The parser, upon encountering an unknown command, outputs a warning but also registers the command as a new custom command (see clang/lib/AST/CommentParser.cpp lines 564:569, the call to actOnUnknownCommand performs the registration, see clang/lib/AST/CommentSema.cpp lines 399 to 404). I can only guess that the intention is that the second time the same unknown command is encountered it is processed normally and no second warning is produced.

And that's indeed what happens: the lexer is quite ignorant of the meaning of these commands (as it should be). It has to ask a traits class what they are (see clang/lib/AST/CommentLexer.cpp lines 365 to 395), but the actOnUnknownCommand call has changed the traits... so upon encountering a repeated unknown command it processes it as if it was a custom command and ends up creating a tok::at_command Token instead of another tok::unknown_command Token.

When the parser then processes the tok::at_command Token, it tries to determine the CommandInfo structure that describes it. However, the lookup is performed by passing an unsigned int as a CommandID.

This would in principle work... but the problem is that the ID field in the CommandInfo struct is declared as such: unsigned ID : 8; So it's a bitfield with 8 bits. That implies a maximum of 256 commands.

You can see where this is going. If the source file is pathologically wrong and contains a vast number of unique unknown Doxygen-style commands, new custom commands are registered and at some point the new IDs allocated for these new commands will be stored in a 1-byte field that wraps around.

At this point, when the parser looks up a command by ID, it will get a CommandInfo struct with an ID field which is different than the one it should have!

This causes the parser to misinterpret the meaning of the token, and eventually the assertion fails. We are lucky that the assertion catches this case; if it didn't, we would likely get spurious behaviour.

The obvious solution to this is to increase the width of the ID field. I have tried to do so but I get several test failures:

clang/test/Index/annotate-comments.cpp clang/test/Index/comment-custom-block-command.cpp clang/test/Index/comment-to-html-xml-conversion.cpp clang/test/Index/headerfile-comment-to-html.m clang/test/Index/retain-comments-from-system-headers.c clang/test/Misc/ast-dump-comment.cpp

I will have to have a proper look at them before I submit a patch. Probably I'll have something ready tomorrow.

991901f3-cc14-4404-b340-165691b62a58 commented 10 years ago

This assertion can be triggered under very peculiar circumstances... I can't seem to reduce beyond this: /* @​s @​tr @​y @​tt @​tg @​alu @​U @​I @​r @​t0 @​t1 @​ur @​S @​E @​pb @​f @​pe @​lue @​re @​oa @​l @​x @​R @​ute @​am @​ei @​oun @​ou @​nl @ @​ien @​fr @​en @​tet @​le @​L @​os @​A @​ro @​o @​ho @​ca @​Tie @​tl @​g @​hr @​et @​fro @​ast @​ae @​nN @​pc @​tae @​ws @​ia @​N @​lc @​psg @​ta @​t2 @​D @​str @​ra @​t3 @​t @​xt @​eN @​fe @​rU @​ar @​eD @​iE @​se @​st1 @​rr @​ime @​ft @​lm @​wD @​wne @​h @​otn @​use @​roi @​ldc @​ln @​d @​ee @​ep @​us @​ut @​u @​n @​Nme @​min @​ma @​pct @​hd @​be @​It @​id @​cm @​ua @​fs @​Al @​axn @​rt @​to @​is @​fo @​i @​an @​de @​tel @​nd @​dic @​Lo @​il @​tle @​axt @​ba @​ust @​ac @​tpe @​tpl @​ctG @​ru @​m @​tG @​it @​rh @​G @​rpc @​el @​er @​w @​eo @​tx @​oo @​dD @​dD / void f();