tree-sitter / tree-sitter-c

C grammar for tree-sitter
MIT License
247 stars 108 forks source link

Handling of preprocessor macros is not general enough #108

Open bbannier opened 2 years ago

bbannier commented 2 years ago

While looking into how one could tackle zeek/tree-sitter-zeek#6 I looked into this grammar for inspiration and noticed that it has similar issues. In C or C++ preprocessor macros can appear around pretty much any token of the language while this grammar only allows for it in a couple of places. I wonder what the best approach to this would be.

As an example, the following source file

int
#if 0
foo
#else
main
#endif
(void) {}

produces this AST

(translation_unit
  (ERROR
    (primitive_type))
  (preproc_if
    (number_literal)
    (ERROR
      (identifier))
    (preproc_else)
    (ERROR
      (identifier)))
  (expression_statement
    (compound_literal_expression
      (type_descriptor

One could come up with nastier examples where e.g., an opening parenthesis is inside a preprocessor block. I am not even sure how the resulting AST should look like, but I feel like I might want something which can support preprocessor directives anywhere, but with more structure than what is extras is typically used for. Would there be a way to support this with an external scanner?

There is also already #13, but it seems to be more focussed on improving the the handling of currently supported special cases.

tr-intel commented 8 months ago

Here's a typical scenario where we come across this problem.

#ifdef __cplusplus
extern "C" {
#endif

#ifdef __cplusplus
}
#endif

AST: https://tree-sitter.github.io/tree-sitter/playground#

translation_unit [0, 0] - [8, 0]
  preproc_ifdef [0, 0] - [6, 6]
    name: identifier [0, 7] - [0, 18]
    linkage_specification [1, 0] - [5, 1]
      value: string_literal [1, 7] - [1, 10]
        string_content [1, 8] - [1, 9]
      body: declaration_list [1, 11] - [5, 1]
        preproc_call [2, 0] - [3, 0] <<<<<<<<< 🧐
          directive: preproc_directive [2, 0] - [2, 6]
        preproc_ifdef [4, 0] - [4, 18]
          name: identifier [4, 7] - [4, 18]
          MISSING #endif [4, 18] - [4, 18]  <<<<<<<<< 🧐
lawmurray commented 1 week ago

Another example from the CMakeCXXCompilerId.cpp file that CMake generates during a build (yes, C++ source file, but also valid C):

char const info_version[] = {
  'I', 'N', 'F', 'O', ':',
  'c','o','m','p','i','l','e','r','_','v','e','r','s','i','o','n','[',
  COMPILER_VERSION_MAJOR,
# ifdef COMPILER_VERSION_MINOR
  '.', COMPILER_VERSION_MINOR,
#  ifdef COMPILER_VERSION_PATCH
   '.', COMPILER_VERSION_PATCH,
#   ifdef COMPILER_VERSION_TWEAK
    '.', COMPILER_VERSION_TWEAK,
#   endif
#  endif
# endif
  ']','\0'};

which gives error nodes:

translation_unit [0, 0] - [15, 0]
  declaration [0, 0] - [13, 12]
    type: primitive_type [0, 0] - [0, 4]
    type_qualifier [0, 5] - [0, 10]
    declarator: init_declarator [0, 11] - [13, 11]
      declarator: array_declarator [0, 11] - [0, 25]
        declarator: identifier [0, 11] - [0, 23]
      value: initializer_list [0, 28] - [13, 11]
        char_literal [1, 2] - [1, 5]
          character [1, 3] - [1, 4]
        char_literal [1, 7] - [1, 10]
          character [1, 8] - [1, 9]
        char_literal [1, 12] - [1, 15]
          character [1, 13] - [1, 14]
        char_literal [1, 17] - [1, 20]
          character [1, 18] - [1, 19]
        char_literal [1, 22] - [1, 25]
          character [1, 23] - [1, 24]
        char_literal [2, 2] - [2, 5]
          character [2, 3] - [2, 4]
        char_literal [2, 6] - [2, 9]
          character [2, 7] - [2, 8]
        char_literal [2, 10] - [2, 13]
          character [2, 11] - [2, 12]
        char_literal [2, 14] - [2, 17]
          character [2, 15] - [2, 16]
        char_literal [2, 18] - [2, 21]
          character [2, 19] - [2, 20]
        char_literal [2, 22] - [2, 25]
          character [2, 23] - [2, 24]
        char_literal [2, 26] - [2, 29]
          character [2, 27] - [2, 28]
        char_literal [2, 30] - [2, 33]
          character [2, 31] - [2, 32]
        char_literal [2, 34] - [2, 37]
          character [2, 35] - [2, 36]
        char_literal [2, 38] - [2, 41]
          character [2, 39] - [2, 40]
        char_literal [2, 42] - [2, 45]
          character [2, 43] - [2, 44]
        char_literal [2, 46] - [2, 49]
          character [2, 47] - [2, 48]
        char_literal [2, 50] - [2, 53]
          character [2, 51] - [2, 52]
        char_literal [2, 54] - [2, 57]
          character [2, 55] - [2, 56]
        char_literal [2, 58] - [2, 61]
          character [2, 59] - [2, 60]
        char_literal [2, 62] - [2, 65]
          character [2, 63] - [2, 64]
        char_literal [2, 66] - [2, 69]
          character [2, 67] - [2, 68]
        identifier [3, 2] - [3, 24]
        ERROR [4, 0] - [4, 30]
          identifier [4, 8] - [4, 30]
        char_literal [5, 2] - [5, 5]
          character [5, 3] - [5, 4]
        identifier [5, 7] - [5, 29]
        ERROR [6, 0] - [6, 31]
          identifier [6, 9] - [6, 31]
        char_literal [7, 3] - [7, 6]
          character [7, 4] - [7, 5]
        identifier [7, 8] - [7, 30]
        ERROR [8, 0] - [8, 32]
          identifier [8, 10] - [8, 32]
        char_literal [9, 4] - [9, 7]
          character [9, 5] - [9, 6]
        identifier [9, 9] - [9, 31]
        ERROR [10, 0] - [12, 7]
          preproc_directive [10, 0] - [10, 9]
        char_literal [13, 2] - [13, 5]
          character [13, 3] - [13, 4]
        char_literal [13, 6] - [13, 10]
          escape_sequence [13, 7] - [13, 9]
bjourne commented 2 days ago

Here is another example in the same vein. This code

if (true)
    #define BLAH
    return;

produces

(translation_unit [0, 0] - [3, 0]
  (if_statement [0, 0] - [0, 9]
    condition: (parenthesized_expression [0, 3] - [0, 9]
      (true [0, 4] - [0, 8]))
    consequence: (expression_statement [0, 9] - [0, 9]))
  (preproc_def [1, 4] - [2, 0]
    name: (identifier [1, 12] - [1, 16]))
  (return_statement [2, 4] - [2, 11]))

But both the preproc_defand the return_statement should be children of the if_statement.