tree-sitter / tree-sitter-c

C grammar for tree-sitter
MIT License
237 stars 106 forks source link

fix: implement Unicode identifiers #127

Closed lf- closed 1 year ago

lf- commented 1 year ago

According to cppreference, identifiers are (XID_Start | '_') XID_Continue*, which is the case as of C++23 and C2x. I have confirmed this myself with the drafts of C++23 and C2x.

https://en.cppreference.com/w/cpp/language/identifiers

Clang indeed implements identifiers as (XID_Start | '_') XID_Continue* in C++ mode and C2x mode, with a slight extension to the character set to include some extra math characters: https://github.com/llvm/llvm-project/blob/231992d9b88fe4e0b4aa0f55ed64d7ba88b231ce/clang/lib/Lex/Lexer.cpp#L1517-L1530

I have verified the performance impact of this change by timing running tree-sitter parse on all the c files in llvm-project before and after the change and the difference seems negligible to nonexistent; within 0.1 seconds of the previous total runtime of 6 ish seconds.

XVilka commented 1 year ago

@aryx Any chance to get this merged?