Open jyn514 opened 5 years ago
This probably deserves its own pass.
Ok I'm actually thinking about this seriously now. The first step is to figure out how to preserve locations for error messages. I think I can start with fn preprocess(&str) -> Vec<Locatable<Token>>
and go from there. It would be nice to also support a single String as output but that shouldn't be too terribly hard to reconstruct from the tokens later.
There's lots of things that are shared between the lexer and preprocessor, it's a weird mix. I'd like to reuse the code but I'd also like to be able to preprocess things on their own. I guess I don't actually need to make them separate to reuse the functions as long as I can go from Vec<Token>
to String
though
Working on this in https://github.com/jyn514/rcc/tree/cpp
Ok how about this - the preprocessor runs after the lexer and the lexer just collects the tokens and puts them an a mini AST. That leaves a clean break between syntax and semantics.
The only issues I see are #line
directives (can hack around this but I'd have to change every location, it'd be ugly) and I'd have to rework the lexer to return an enum { Token(...), CppToken(...) }
.
I could special case line in the lexer, that shouldn't be too hacky. Something like this:
if let CppToken::Line(n) = token {
self.location.line = n;
}
and then everything else still gets done in the preprocessor. I like that idea, I think I'll do it.
Actually this can't go after the lexer because it's affected by whitespace :( These have different meanings:
#define f(a) a
f(a) // emits a
#define f (a) a
f(a) // emits (a) a
Crazy idea: substitute self.lexer
with the contents of #if
or #include
sections, which allows doing basically everything in place without changing existing code.
That works well for #include
s, but I don't think it's a good idea for #if
directives. I think a state machine would be a good idea instead so it can keep being an external iterator instead of an internal one.
Update on #if
:
It has to support arbitrary C expressions in an #if
directive, as well as preprocessing those tokens. I could make a new Parser
and call parser.expr()
, but there could be #define
d ids inside the expression, so it also has to be preprocessed first. I think I could preprocess all the tokens between #if and the end of the line, change all the Id(...)
tokens to Literal(Int(0))
, and then pass that into a new parser instance.
This gets even worse: not all valid C expressions are valid preprocessor expressions. For examples:
$ run_clang
#if (int)1
#endif
<stdin>:1:10: error: token is not a valid binary operator in a preprocessor subexpression
$ run_clang
#if 1 = 1
#endif
<stdin>:2:7: error: token is not a valid binary operator in a preprocessor subexpression
$ run_clang
run_clang
#if 1.31 + 1
#endif
<stdin>:1:5: error: floating point literal in preprocessor expression
Test cases: https://github.com/pfultz2/Cloak
Section 6.10:
#if
/#ifdef
/#ifndef
/#elif
/#else
/defined
conditional compilation (added in https://github.com/jyn514/rcc/pull/184)#include
headers#define
macros and substitutions__VA_ARGS__
#
and##
operatorsundef
#line
control (can be ignored for now, waiting on #152, https://github.com/brendanzab/codespan/issues/157)#error
directives#pragma
_Pragma ()
#
on its own (ignored)__STDC_NO_ATOMICS__
__STDC_NO_COMPLEX__
__STDC_NO_THREADS__
__STDC_NO_VLA__