rui314 / 8cc

A Small C Compiler
MIT License
6.13k stars 740 forks source link

CPP expression parser vs C const expressions. #31

Closed andrewchambers closed 9 years ago

andrewchambers commented 9 years ago

Currently I believe these are handled by the same code. I think it is best to have two different expression parsers, one for CPP expressions in #if directives, and a completely different one for constant C expressions.

The C constant expression evaluator is going to need to handle a lot more cases including casts and labels correctly iirc. The CPP one does not really need to operate on an AST at all.

I think this splitting needs to be done before #5 can be fixed.

Thoughts?

As a side note:

It is actually really easy to use this algorithm for CPP expressions: http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method The lack of right associative operators after ternary in CPP expressions means the algorithm is simplified even more than the wiki article.

I actually implemented this in Go for my own preprocessor. To get an idea of what it looks like:

https://github.com/andrewchambers/cc/blob/master/src/cc/cpp/cppexpr.go https://github.com/andrewchambers/cc/blob/master/src/cc/cpp/cppexpr.go#L261 - actual algorithm. https://github.com/andrewchambers/cc/blob/master/src/cc/cpp/cppexpr_test.go

rui314 commented 9 years ago

Funny you should mention the operator precedence algorithm. I used to use that in parse.c to parse an expression. I rewrote it as a recursive-descent parser at some point.

The expression allowed after #if is not different from C expression. It's evaluated in the same way as described in C11 6.5. The difference is that all the identifiers are replaced with "0" before evaluation. As a result, a seemingly legitimate expression "#if sizeof(int)" is, for example, replaced with "#if 0(0)", which cause the compiler to fail. Although the grammar allows you to use more complex expression like casts after #if, you cannot access them because of the token replacement.

CPP expression is a subset of C constant expression. I don't think we should have another evaluator for CPP.

andrewchambers commented 9 years ago

Ahh, I understand. How does 8cc treat things like ++ -- & inside cpp directives? just the same as identifiers?

andrewchambers commented 9 years ago

And I agree that recursive descent is much easier to understand and debug. Even if its less efficient. 8cc is easily fast enough.

rui314 commented 9 years ago

Operators like + or & (or ++, which has no use in CPP) are classified as punctuator, not identifier, so they will remain the same after identifier replacement and just passed to the evaluator.

andrewchambers commented 9 years ago

Ah ok - I think you are mostly correct.

One note:

quoting https://gcc.gnu.org/onlinedocs/cpp/If.html

"It carries out all calculations in the widest integer type known to the compiler; on most machines supported by GCC this is 64 bits. This is not the same rule as the compiler uses to calculate the value of a constant expression, and may give different results in some cases."

Doesn't this means the cpp evaluator needs to use 64 bits, where normal constant folding may be 32 bits after integer promotion? I think overflow on #if is incredibly rare anyway so it shouldn't matter at all in practice.

On a side note, I've setup an automatic bug finder using the gcc test suite and creduce. It focuses only on code that both gcc and tcc can correctly compile. In a few minutes I found at least 8 bugs that can be reduced to a few lines of code. I'll hold off until you are done reworking the backend, but i'll setup a vagrant vm image + instructions so you can easily reproduce what I am doing for yourself when you are ready.

rui314 commented 9 years ago

Maybe the easiest way to mimic the GCC behavior is to provide a way to set default integer type. By default, a pp-number, say "0", is converted to 0 of type int. We could define a flag to make the parser convert a token to of type long. While we are reading an expression from CPP, we would turn the flag on, so that all integers are automatically read as if they had L prefix. That forces the evaluator to compute the expression in 64 bit precision.