Open bbannier opened 2 years ago
Here's a typical scenario where we come across this problem.
#ifdef __cplusplus
extern "C" {
#endif
#ifdef __cplusplus
}
#endif
AST: https://tree-sitter.github.io/tree-sitter/playground#
translation_unit [0, 0] - [8, 0]
preproc_ifdef [0, 0] - [6, 6]
name: identifier [0, 7] - [0, 18]
linkage_specification [1, 0] - [5, 1]
value: string_literal [1, 7] - [1, 10]
string_content [1, 8] - [1, 9]
body: declaration_list [1, 11] - [5, 1]
preproc_call [2, 0] - [3, 0] <<<<<<<<< 🧐
directive: preproc_directive [2, 0] - [2, 6]
preproc_ifdef [4, 0] - [4, 18]
name: identifier [4, 7] - [4, 18]
MISSING #endif [4, 18] - [4, 18] <<<<<<<<< 🧐
Another example from the CMakeCXXCompilerId.cpp
file that CMake generates during a build (yes, C++ source file, but also valid C):
char const info_version[] = {
'I', 'N', 'F', 'O', ':',
'c','o','m','p','i','l','e','r','_','v','e','r','s','i','o','n','[',
COMPILER_VERSION_MAJOR,
# ifdef COMPILER_VERSION_MINOR
'.', COMPILER_VERSION_MINOR,
# ifdef COMPILER_VERSION_PATCH
'.', COMPILER_VERSION_PATCH,
# ifdef COMPILER_VERSION_TWEAK
'.', COMPILER_VERSION_TWEAK,
# endif
# endif
# endif
']','\0'};
which gives error nodes:
translation_unit [0, 0] - [15, 0]
declaration [0, 0] - [13, 12]
type: primitive_type [0, 0] - [0, 4]
type_qualifier [0, 5] - [0, 10]
declarator: init_declarator [0, 11] - [13, 11]
declarator: array_declarator [0, 11] - [0, 25]
declarator: identifier [0, 11] - [0, 23]
value: initializer_list [0, 28] - [13, 11]
char_literal [1, 2] - [1, 5]
character [1, 3] - [1, 4]
char_literal [1, 7] - [1, 10]
character [1, 8] - [1, 9]
char_literal [1, 12] - [1, 15]
character [1, 13] - [1, 14]
char_literal [1, 17] - [1, 20]
character [1, 18] - [1, 19]
char_literal [1, 22] - [1, 25]
character [1, 23] - [1, 24]
char_literal [2, 2] - [2, 5]
character [2, 3] - [2, 4]
char_literal [2, 6] - [2, 9]
character [2, 7] - [2, 8]
char_literal [2, 10] - [2, 13]
character [2, 11] - [2, 12]
char_literal [2, 14] - [2, 17]
character [2, 15] - [2, 16]
char_literal [2, 18] - [2, 21]
character [2, 19] - [2, 20]
char_literal [2, 22] - [2, 25]
character [2, 23] - [2, 24]
char_literal [2, 26] - [2, 29]
character [2, 27] - [2, 28]
char_literal [2, 30] - [2, 33]
character [2, 31] - [2, 32]
char_literal [2, 34] - [2, 37]
character [2, 35] - [2, 36]
char_literal [2, 38] - [2, 41]
character [2, 39] - [2, 40]
char_literal [2, 42] - [2, 45]
character [2, 43] - [2, 44]
char_literal [2, 46] - [2, 49]
character [2, 47] - [2, 48]
char_literal [2, 50] - [2, 53]
character [2, 51] - [2, 52]
char_literal [2, 54] - [2, 57]
character [2, 55] - [2, 56]
char_literal [2, 58] - [2, 61]
character [2, 59] - [2, 60]
char_literal [2, 62] - [2, 65]
character [2, 63] - [2, 64]
char_literal [2, 66] - [2, 69]
character [2, 67] - [2, 68]
identifier [3, 2] - [3, 24]
ERROR [4, 0] - [4, 30]
identifier [4, 8] - [4, 30]
char_literal [5, 2] - [5, 5]
character [5, 3] - [5, 4]
identifier [5, 7] - [5, 29]
ERROR [6, 0] - [6, 31]
identifier [6, 9] - [6, 31]
char_literal [7, 3] - [7, 6]
character [7, 4] - [7, 5]
identifier [7, 8] - [7, 30]
ERROR [8, 0] - [8, 32]
identifier [8, 10] - [8, 32]
char_literal [9, 4] - [9, 7]
character [9, 5] - [9, 6]
identifier [9, 9] - [9, 31]
ERROR [10, 0] - [12, 7]
preproc_directive [10, 0] - [10, 9]
char_literal [13, 2] - [13, 5]
character [13, 3] - [13, 4]
char_literal [13, 6] - [13, 10]
escape_sequence [13, 7] - [13, 9]
Here is another example in the same vein. This code
if (true)
#define BLAH
return;
produces
(translation_unit [0, 0] - [3, 0]
(if_statement [0, 0] - [0, 9]
condition: (parenthesized_expression [0, 3] - [0, 9]
(true [0, 4] - [0, 8]))
consequence: (expression_statement [0, 9] - [0, 9]))
(preproc_def [1, 4] - [2, 0]
name: (identifier [1, 12] - [1, 16]))
(return_statement [2, 4] - [2, 11]))
But both the preproc_def
and the return_statement
should be children of the if_statement
.
While looking into how one could tackle zeek/tree-sitter-zeek#6 I looked into this grammar for inspiration and noticed that it has similar issues. In C or C++ preprocessor macros can appear around pretty much any token of the language while this grammar only allows for it in a couple of places. I wonder what the best approach to this would be.
As an example, the following source file
produces this AST
One could come up with nastier examples where e.g., an opening parenthesis is inside a preprocessor block. I am not even sure how the resulting AST should look like, but I feel like I might want something which can support preprocessor directives anywhere, but with more structure than what is
extras
is typically used for. Would there be a way to support this with an external scanner?There is also already #13, but it seems to be more focussed on improving the the handling of currently supported special cases.