jyn514 / saltwater

A C compiler written in Rust, with a focus on good error messages.
BSD 3-Clause "New" or "Revised" License
290 stars 27 forks source link

Support preprocessing #5

Open jyn514 opened 5 years ago

jyn514 commented 5 years ago

Section 6.10:

jyn514 commented 5 years ago

This probably deserves its own pass.

jyn514 commented 4 years ago

Ok I'm actually thinking about this seriously now. The first step is to figure out how to preserve locations for error messages. I think I can start with fn preprocess(&str) -> Vec<Locatable<Token>> and go from there. It would be nice to also support a single String as output but that shouldn't be too terribly hard to reconstruct from the tokens later.

jyn514 commented 4 years ago

There's lots of things that are shared between the lexer and preprocessor, it's a weird mix. I'd like to reuse the code but I'd also like to be able to preprocess things on their own. I guess I don't actually need to make them separate to reuse the functions as long as I can go from Vec<Token> to String though

jyn514 commented 4 years ago

Working on this in https://github.com/jyn514/rcc/tree/cpp

jyn514 commented 4 years ago

Ok how about this - the preprocessor runs after the lexer and the lexer just collects the tokens and puts them an a mini AST. That leaves a clean break between syntax and semantics.

The only issues I see are #line directives (can hack around this but I'd have to change every location, it'd be ugly) and I'd have to rework the lexer to return an enum { Token(...), CppToken(...) }.

jyn514 commented 4 years ago

I could special case line in the lexer, that shouldn't be too hacky. Something like this:

if let CppToken::Line(n) = token {
   self.location.line = n;
}

and then everything else still gets done in the preprocessor. I like that idea, I think I'll do it.

jyn514 commented 4 years ago

Actually this can't go after the lexer because it's affected by whitespace :( These have different meanings:

#define f(a) a
f(a) // emits a
#define f (a) a
f(a) // emits (a) a
jyn514 commented 4 years ago

Crazy idea: substitute self.lexer with the contents of #if or #include sections, which allows doing basically everything in place without changing existing code.

jyn514 commented 4 years ago

That works well for #includes, but I don't think it's a good idea for #if directives. I think a state machine would be a good idea instead so it can keep being an external iterator instead of an internal one.

jyn514 commented 4 years ago

Update on #if:

It has to support arbitrary C expressions in an #if directive, as well as preprocessing those tokens. I could make a new Parser and call parser.expr(), but there could be #defined ids inside the expression, so it also has to be preprocessed first. I think I could preprocess all the tokens between #if and the end of the line, change all the Id(...) tokens to Literal(Int(0)), and then pass that into a new parser instance.

jyn514 commented 4 years ago

This gets even worse: not all valid C expressions are valid preprocessor expressions. For examples:

$ run_clang 
#if (int)1
#endif
<stdin>:1:10: error: token is not a valid binary operator in a preprocessor subexpression
$ run_clang
#if 1 = 1
#endif
<stdin>:2:7: error: token is not a valid binary operator in a preprocessor subexpression
$ run_clang
run_clang 
#if 1.31 + 1 
#endif
<stdin>:1:5: error: floating point literal in preprocessor expression
jyn514 commented 4 years ago

Test cases: https://github.com/pfultz2/Cloak