joydo / CompilerLearning

Self-Learning(LLVM,GCC,TCC)
3 stars 0 forks source link

handwriting one compiler[wip] #8

Open joydo opened 3 years ago

joydo commented 3 years ago

Discussions For https://briancallahan.net

The author forgot default cases in various switch (tok) statements: static void factor(void) { switch (type) { case TOK_IDENT: case TOK_NUMBER: next(); break; case TOK_LPAREN: expect(TOK_LPAREN); expression(); expect(TOK_RPAREN); } } The TOK_ enumeration has a good many more members than just those three. If any other one occurs in the code, or an outright corrupt type value occurs, it silently falls through. If the error is handled at all, it will result in some strange diagnostic, in some higher level code that thinks it has seen an expression with a factor in it.

Adding a break after the last block in a switch is a good idea; someone adding a new case might forget. (Wit diagnostic support, like GCC requiring a fallthrough comment, this is less important nowadays).

If a switch deliberately falls through for unmatched cases, have default: break; to express that explicitly.

This whole approach of the parser consisting of void functions is poor for 2021. It's as if the author is deliberately trolling everyone who has ever taken any level of interest in functional programming.

The author will have a hard time fitting the necessary semantic actions into this code, like constructing an abstract syntax tree, and might succumb to the temptation to do it with some hacks which build the structure procedurally using global variables.

The parser would be a more future proof if it anticipated the need for backtracking. So that is to say, a function such as factor() above could be considered to be a pattern matching construct, and return a useful value indicating whether it matched or failed to match. Furthermore, if it it failed to match, then it would backtrack, leaving the state of the input stream as it was prior to entry.

You can then do interesting things, because you're essentially LL(k) now, like speculatively parse the input using several possible constructs in priority order, until one matches.

The code inconsistently mixes the use of concrete character constants like '.' and '{' and #define symbols that expand to character constants like #define TOK_PLUS '+'.

Using the character constants in the code as in expect('+') or whatever is clearer. The Yacc approach of using constants for abstract tokens, (which are enumerated starting at some value above 256), and concrete character constants for single-character tokens that denote themselves, is worth mimicking.

You're not going to redefine TOK_PLUS to something other than '+' in the future, and doing so would be like like changing #define FOUR 4 to #define FOUR 5. TOK_PLUS doesn't inform any better than '+', and isn't any easier to find with grep.