No custom C++11 attributes are parsed

JhnW commented 2 years ago

Hi. The C ++ 17 standard specifies that attributes not defined in the compiler implementation will be ignored without error. I am using the C ++ 17 clang parser module to extract information (being strict, python bins but it doesn't matter now). Clang ignores unfamiliar attributes at the stage of parsing the code. This is quite a problem for my project, libclang is a great parser, and information about unfamiliar attributes (even as UNEXPOSED_ATTR) along with parsing their arguments would make it easier to use it as part of an advanced processor (something I do in my own project). Is it possible to expose this information in clang lib and ignore it at another stage?

AaronBallman commented 2 years ago

The C ++ 17 standard specifies that attributes not defined in the compiler implementation will be ignored without error.

The C++ standard doesn't specify anything about diagnostic strength. What it does say in [dcl.attr]p6 (http://eel.is/c++draft/dcl.attr#grammar-6): For an attribute-token (including an attribute-scoped-token) not specified in this document, the behavior is implementation-defined. Any attribute-token that is not recognized by the implementation is ignored.

"Ignored" doesn't mean "not diagnosed" (our implementation-defined behavior is to diagnose unknown attributes by default because of the potential for hard-to-debug problems stemming from vendor-specified attributes that we don't support).

Clang ignores unfamiliar attributes at the stage of parsing the code.

Correct.

This is quite a problem for my project, libclang is a great parser, and information about unfamiliar attributes (even as UNEXPOSED_ATTR) along with parsing their arguments would make it easier to use it as part of an advanced processor (something I do in my own project). Is it possible to expose this information in clang lib and ignore it at another stage?

Yes, but it's a qualified yes. What we'd like to do is introduce a new UnknownAttr semantic attribute class. When the parser finds an attribute it doesn't recognize, it would create one of these so that the information is retained in the AST. This is helpful for all sorts of use cases, such as pretty printing the AST back to source or doing AST matchers from clang-tidy. However, the challenge is with attribute arguments. Because those can be arbitrary token soup (with balanced delimiters), we can't make AST nodes for any of the arguments, or really do anything more than store the tokens that comprise the arguments. We had this same issue when dealing with plugin attributes and I don't know that we've found a better approach there.

That said, I'm not aware of anyone actively working on this.

llvmbot commented 2 years ago

@llvm/issue-subscribers-clang-frontend

JhnW commented 2 years ago

The C++ standard doesn't specify anything about diagnostic strength. What it does say in [dcl.attr]p6 (http://eel.is/c++draft/dcl.attr#grammar-6): For an attribute-token (including an attribute-scoped-token) not specified in this document, the behavior is implementation-defined. Any attribute-token that is not recognized by the implementation is ignored.

"Ignored" doesn't mean "not diagnosed" (our implementation-defined behavior is to diagnose unknown attributes by default because of the potential for hard-to-debug problems stemming from vendor-specified attributes that we don't support).

I didn't mean no diagnosis but no error. From what I can see, most interpretations of the C ++ standard (and implementation) agree with this concept. Before C ++ 17, there was no ignore condition, making the whole thing dependent on the implementation (allowing you to throw a hard bug in any case, for example). But I don't think that should matter to a clang parser. While the compiler is "implementation-defined", a parser is a much more universal tool used for code that gets compiled with other tools.

Thank you very much for a comprehensive answer. From my point of view, the list of tokens would be sufficient (almost even the mere presence of the element in children along with the location in the code would be amazing).

That said, I'm not aware of anyone actively working on this.

Pity. It seems to me that this is not a good issue to get into a clang (PR) project on the other hand as I think about parsing that regex (not for the first time) I get seizures. While external parsing things in a "preamble" of type is okay for arguments etc. is monstrous.

AaronBallman commented 2 years ago

I didn't mean no diagnosis but no error.

We don't diagnose as an error (unless you opt-in to doing that with -Werror, so I might be misunderstanding you: https://godbolt.org/z/brr7xfhhe

But I don't think that should matter to a clang parser. While the compiler is "implementation-defined", a parser is a much more universal tool used for code that gets compiled with other tools.

As far as the standard is concerned, the implementation is... everything. It's the compiler, the linker, and the standard library (everything needed to meet the requirements of the abstract machine). That said, I think it's quite reasonable for this to not be handled at the parsing level for [[]] style attributes only, because we know how to parse unknown attributes without causing problems. The same is not true of the other attribute styles, but we might (hopefully) still be able to support them the same way as well. They're a bit trickier though because some attributes (like GNU-style attributes) can change parsing behavior in terms of whether something is parsed as a type or not.

Thank you very much for a comprehensive answer. From my point of view, the list of tokens would be sufficient (almost even the mere presence of the element in children along with the location in the code would be amazing).

That's good to know, thanks! If we're storing actual Token objects, then we'd also have all the source location information, but I'm not even certain we'll be able to do that much. For example, consider a hypothetical vendor attribute of [[vendor::foo(identifiers are easy, but what about § which is a punctuator but not a token and " which is not terminated and also isn't a token? Oops, there is also an unterminated single quote in there too and these commas do not separate arguments.)]]

Pity. It seems to me that this is not a good issue to get into a clang (PR) project on the other hand as I think about parsing that regex (not for the first time) I get seizures. While external parsing things in a "preamble" of type is okay for arguments etc. is monstrous.

I think I agree that this isn't a good first issue for getting into Clang. But I also think it wouldn't be the hardest thing to support either. For example, an initial pass at this could ignore the problem with the attribute arguments by instead exposing the left paren and right paren location for the argument list. At least with that amount of information, an enterprising person can go back through the source manager to get the source text for the arguments (so you could still pretty print from the AST back to source, for example), but we've made a good step towards at least retaining the unknown attribute in the AST.

JhnW commented 2 years ago

We don't diagnose as an error (unless you opt-in to doing that with -Werror, so I might be misunderstanding you:

I know. Maybe I didn't express myself clearly / I fell into digressions too much;)

They're a bit trickier though because some attributes (like GNU-style attributes) can change parsing behavior in terms of whether something is parsed as a type or not.

Good point. I haven't thought of any attributes that might change the parsing context.

My biggest disappointment was the complete absence of unknown attributes in the element's children's list. With this, perhaps without even a token list, just knowing the loops in the code, the use of attribute information is now within reach. Seems to me a bare minimum option?

AaronBallman commented 2 years ago

My biggest disappointment was the complete absence of unknown attributes in the element's children's list. With this, perhaps without even a token list, just knowing the loops in the code, the use of attribute information is now within reach. Seems to me a bare minimum option?

That's my thinking as well.

llvm / llvm-project

No custom C++11 attributes are parsed #57748