Assertion in CommentParser.cpp with -Wdocumentation

wjristow commented 10 years ago


Bugzilla Link	21254
Resolution	FIXED
Resolved on	Oct 16, 2014 06:20
Version	unspecified
OS	Windows NT
CC	@majnemer,@DougGregor,@gribozavr

Extended Description

The test-case, "test.c" appended below, hits an assertion-failure when '-Weverything' is used (more specifically, it's the warning '-Wdocumentation' that needs to be enabled to cause the assertion-failure.)

Appears to happen irrespective of which target is specified or other switches (e.g., -O, -g, etc.) Tested with:

$ clang --version clang version 3.6.0 (trunk 219524) Target: x86_64-unknown-linux-gnu Thread model: posix $ clang -c -Wdocumentation test.c test.c:30:27: warning: unknown command tag name 'in'; did you mean 'fn'? [-Wdocumentation]

@inlen: the length of @in ^~~ fn clang: ..../llvm/tools/clang/lib/AST/CommentParser.cpp:595: clang::comments::BlockContentComment* clang::comments::Parser::parseParagraphOrBlockCommand(): Assertion `Info->IsInlineCommand' failed. ... $

"test.c" below

// ============================================================================ /**

@size: the size requested in bytes
@str: a zero terminated string
@systemId: The system ID of the entity
@atts: An array of name/value attributes pairs, NULL terminated
@target: the target name
@value: the comment content
@URL: The System ID of the resource requested
@ID: The Public ID of the resource requested
@userData: user provided data for the error callback
@data0: the first data
@data1: the second data
@user: extra user provided data to the walker
@SystemID: the SYSTEM ID (e.g. filename or URL)
@ExternalID: the external ID
@publicId: The public ID of the entity
@fullname: the attribute name
@type: the attribute type
@defaultValue: the attribute default value
@tree: the tree of enumerated value set
@notationName: the name of the notation
@loc: A SAX Locator
@ctx: an XML parser context
@URI: the element namespace name if available
@attributes: pointer to the array of (localname/prefix/URI/value/end)
@localname: the local name of the element
@prefix: the element namespace prefix if available
@outlen: the length of @out
@inlen: the length of @in
@filename: the filename or URI
@buffer: the buffer to store data read
@len: the length of the buffer in bytes
@context: an Output context
@role: the role string
@nbLocators: the number of locators detected on the link
@roles: pointer to the array of locator roles
@nbArcs: the number of arcs detected on the link
@from: pointer to the array of source roles found on the arcs
@to: pointer to the array of target roles found on the arcs
@show: array of values for the show attributes found on the arcs
@actuate: array of values for the actuate attributes found on the arcs
@nbTitles: the number of titles detected on the link
@title: array of titles detected on the link
@langs: array of xml:lang values for the titles
@hrefs: pointer to the array of locator hrefs
@ent: the entity
@firstNode: the fist node in the chunk
@lastNode: the last nod in the chunk
@localName: the local name
Returns the localName if @namespaceName is NULL, a formatted
@pc: the type of processContents
@retValue: the returned value
@ws: the whitespace type of the value
@itemName: the name of the item
@itemNode: the node of the item
@local: the local name
@parsing: if the function is used during the parse
@extra: extra informations
@str2: extra data
@strData1: extra data
@strData2: extra data
@strData3: extra data
@str3: extra parameter for the message display
@str4: extra parameter for the message display
@ctxt: the schema validation context
@ownerName: the name of the owner
@refName: the referenced local name
@refURI: the referenced namespace URI
@attr: the illegal attribute node
@ownerDes: the designation of the attribute's owner
@itemElem: the node of the schema item
@message: the error message
@str1: an optional param for the error message
@error: the error code
@itemDes: the designation of the schema type
@facet: the illegal facet
@elemDes: the designation of the parent element node
@onwerDes: the designation of the holder of the content
@ownerElem: the node of the holder of the content
@child: the invalid child node
@content: the optional string describing the correct content
@use: an attribute use
@prohib: an attribute use prohibition
@wildcard: a wildcard structure
@alink: a type link
@idc: a identity-constraint definition
@elem: an element
@depth: the depth used for intentation
@uses: attribute uses list
@output: the file output
a namespace URI of @uri.
@ns: the target namespace of the attribute
@nsName: the target namespace
@min: the minOccurs
@max: the maxOccurs
@pctxt: a schema parser context
@head: the head of the substitution group
@member: the new member of the substitution group
@ownerItem: the schema object owner if existent
@wildc: the wildcard, already created
@schema: the schema being built
@flagQualified: the specific flag for "qualified"
@flags: the flags to be modified
@flagAll: the specific flag for "#all"
@flagExtension: the specific flag for "extension"
@flagRestriction: the specific flag for "restriction"
@flagSubstitution: the specific flag for "substitution"
@flagList: the specific flag for "list"
@flagUnion: the specific flag for "union"
@item: the schema component
@annot: the annotation
@node: a subtree containing XML Schema informations
@topLevel: indicates if this is global declaration
@nodes: the list of top level nodes
@dict: the dictionary to be used
@schemaLocation, NULL otherwise.
@final: the final
@particle: the particle
@actxt: a context
@baseType: the base type definition
@subset: the subset of ('restriction', ect.)
@ancestor: an ancestor of @ctxtType
@typeDecl: the schema type definition
@ctxtMGroup: the searched model group
@selfMGroup: the second searched model group
@mg: the model group
@ctxtGr: the searched attribute group
@list: the attribute uses
and returned via the @prohibs list.
@attrGr: the attribute group definition
@attrgrpDecl: the schema attribute definition
@decl: the element declaration
@err: the error callback
@warn: the warning callback
@serror: the structured error function
@vctxt: a schema validation context
@location: the location of the schema
@idcDef: the IDC definition
@idcDef: the IDC definition of this binding */ void foo(){} // ============================================================================

llvmbot commented 9 years ago

Fixed in r219802. The buildbots seem happy with the change.

llvmbot commented 9 years ago

Updating NumInlineCommandCommentBits accordingly should be sufficient.

I have now sent a patch for review.

I've updated the Num... entries and I've also provided a definition of the number of bits in the ID field (via an enum, like in Comment) so that we don't have to use the same literal integer in more than one place.

gribozavr commented 9 years ago

Updating NumInlineCommandCommentBits accordingly should be sufficient.

llvmbot commented 9 years ago

I wanted to use it for the test, so I could dump the registered commands, but now I guess I have to find the bug in that tool too. :-)

And I found it, the problem is actually in clang/include/AST/Comment.h.

Class Comment has an anonymous union of structs that are only containing bitfields. Some of these have a CommandID bitfield, whose width is 8. Of course I've tried to extend it... ... but then the structs are embedded in other structs, and the entire bitfield package is embedded in another bitfield whose type is unsigned int, and of course the size goes over the maximum size of an unsigned int.

This means that we can't arbitrarily change the size of that bitfield without re-engineering those structs. Why are we using bitfields for this? Do we really need to pack those structs in as few bits as possible? I don't think we'll have millions of Comment instances, maybe we don't need to optimize memory that much...

Anyway, I can try to change CommandID to a reasonable size that doesn't go over the budget, but this might only solve our issue for small numbers. We'll still have an upper limit for the number of commands.

llvmbot commented 9 years ago

Awesome! Thank you.

In the process of writing a test, I have also found out that although the clang functionality now works, "c-index-test" still prints out the 'wrapped around' command codes if I run it on the provided test. So there must be another narrow data field somewhere in that tool.

I wanted to use it for the test, so I could dump the registered commands, but now I guess I have to find the bug in that tool too. :-)

gribozavr commented 9 years ago

Awesome! Thank you.

llvmbot commented 9 years ago

Thank you for investigating this. The idea behind this handling of unknown commands is to allow third-party clients of Clang as a library to handle their own commands, possibly without even defining anything or modifying Clang source.

Hm... yes, that makes sense. In which case it's even more important to deal with the bitfield problem, because a Clang client could very well add more than 256 commands.

It would be interesting to understand why increasing the width of the bitfield causes failures somewhere else. There is TableGen code that knows the layout of this struct, but the bitfield width should not matter

It turns out I had made a silly mistake and extended the bitfield too much... I just wanted to quickly test my hypothesis so I set the width to a large number which was too much for an unsigned int. It made the test case work, but yeah, silly me.

If I actually remove the bitfield specifier, and leave the ID field as an unsigned int (like all the arguments of those functions using CommandIDs) everything seems to work.

I'll post a patch for review soon.

gribozavr commented 9 years ago

Dario,

Thank you for investigating this. The idea behind this handling of unknown commands is to allow third-party clients of Clang as a library to handle their own commands, possibly without even defining anything or modifying Clang source.

It would be interesting to understand why increasing the width of the bitfield causes failures somewhere else. There is TableGen code that knows the layout of this struct, but the bitfield width should not matter:

/// When reordering, adding or removing members please update the corresponding /// TableGen backend.

llvmbot commented 9 years ago

I think I found what the problem is, and I may have a fix, but it needs a bit more work and double-checking before I can submit a patch.

TL;DR explanation:

The CommandInfo::ID field is an 8-bit bitfield (see clang/include/clang/AST/CommentCommandTraits.h).
There exists a situation where enough new CommandInfo::IDs are created by the CommentParser so that the number wraps around.
The CommentParser then receives a token that should have an ID greater than 256 but has wrapped around, and therefore looks like a different ID.
The assertion fails because the parser is misinterpreting the token.

The fix would be to increase the width of the CommandInfo::ID field to something sensible. However, if I do so, a few tests start to fail.

--- DETAILED EXPLANATION ---

The CommentLexer can recognize Doxygen-style documentation comments and create tokens for those; the CommentParser then acts on them to perform checks such as identifying Doxygen syntax errors or unknown commands. It is also possible to define custom commands via a clang option, so that the lexer and the parser will recognize them.

This "custom commands" system however is also used for something which (IMHO) it shouldn't be used for. The parser, upon encountering an unknown command, outputs a warning but also registers the command as a new custom command (see clang/lib/AST/CommentParser.cpp lines 564:569, the call to actOnUnknownCommand performs the registration, see clang/lib/AST/CommentSema.cpp lines 399 to 404). I can only guess that the intention is that the second time the same unknown command is encountered it is processed normally and no second warning is produced.

And that's indeed what happens: the lexer is quite ignorant of the meaning of these commands (as it should be). It has to ask a traits class what they are (see clang/lib/AST/CommentLexer.cpp lines 365 to 395), but the actOnUnknownCommand call has changed the traits... so upon encountering a repeated unknown command it processes it as if it was a custom command and ends up creating a tok::at_command Token instead of another tok::unknown_command Token.

When the parser then processes the tok::at_command Token, it tries to determine the CommandInfo structure that describes it. However, the lookup is performed by passing an unsigned int as a CommandID.

This would in principle work... but the problem is that the ID field in the CommandInfo struct is declared as such: unsigned ID : 8; So it's a bitfield with 8 bits. That implies a maximum of 256 commands.

You can see where this is going. If the source file is pathologically wrong and contains a vast number of unique unknown Doxygen-style commands, new custom commands are registered and at some point the new IDs allocated for these new commands will be stored in a 1-byte field that wraps around.

At this point, when the parser looks up a command by ID, it will get a CommandInfo struct with an ID field which is different than the one it should have!

This causes the parser to misinterpret the meaning of the token, and eventually the assertion fails. We are lucky that the assertion catches this case; if it didn't, we would likely get spurious behaviour.

The obvious solution to this is to increase the width of the ID field. I have tried to do so but I get several test failures:

clang/test/Index/annotate-comments.cpp clang/test/Index/comment-custom-block-command.cpp clang/test/Index/comment-to-html-xml-conversion.cpp clang/test/Index/headerfile-comment-to-html.m clang/test/Index/retain-comments-from-system-headers.c clang/test/Misc/ast-dump-comment.cpp

I will have to have a proper look at them before I submit a patch. Probably I'll have something ready tomorrow.

991901f3-cc14-4404-b340-165691b62a58 commented 10 years ago

This assertion can be triggered under very peculiar circumstances... I can't seem to reduce beyond this: /* @s @tr @y @tt @tg @alu @U @I @r @t0 @t1 @ur @S @E @pb @f @pe @lue @re @oa @l @x @R @ute @am @ei @oun @ou @nl @ @ien @fr @en @tet @le @L @os @A @ro @o @ho @ca @Tie @tl @g @hr @et @fro @ast @ae @nN @pc @tae @ws @ia @N @lc @psg @ta @t2 @D @str @ra @t3 @t @xt @eN @fe @rU @ar @eD @iE @se @st1 @rr @ime @ft @lm @wD @wne @h @otn @use @roi @ldc @ln @d @ee @ep @us @ut @u @n @Nme @min @ma @pct @hd @be @It @id @cm @ua @fs @Al @axn @rt @to @is @fo @i @an @de @tel @nd @dic @Lo @il @tle @axt @ba @ust @ac @tpe @tpl @ctG @ru @m @tG @it @rh @G @rpc @el @er @w @eo @tx @oo @dD @dD / void f();

llvm / llvm-project

Assertion in CommentParser.cpp with -Wdocumentation #21628

Extended Description