jyn514 / saltwater

A C compiler written in Rust, with a focus on good error messages.
BSD 3-Clause "New" or "Revised" License
293 stars 27 forks source link

Get rid of unput in lexer #475

Closed jyn514 closed 4 years ago

jyn514 commented 4 years ago

Currently, unput exists only for this hack:

https://github.com/jyn514/saltwater/blob/546ed7de472c2be7b57c3e44fba628afb856b9ae/src/lex/mod.rs#L662

If we got rid of that, we could get rid of unput() altogether, which I've been trying to do for ages. This will make #474 significantly easier.

Why does this exist?

The use case for seen_line_token is that #warning x is a preprocessor directive, but 1 + #warning x is not. This is part of the reason I tied the preprocessor to the lexer. Say you have

"a"
"b"

Those two tokens should be concatenated into a single string, in which case everything is fine. However, if you have

"a"
# warning

then the preprocessor needs to know no token has been seen yet on the second line. But seen_line_token gets set after parse_string returns! So it's treated as if you wrote "a" #warning, which is not correct. The current hack is to put back a newline character if we saw a newline in consume_whitespace, which requires unput

What is the fix?

We can't use consume_whitespace_no_newline because we need to know about following strings, there's no way to conditionally consume newlines.

Instead, we can change the algorithm: concatenate the strings in the preprocessor (or parser) instead of the lexer. The main difference there is that the preprocessor can store a pending token, where as the lexer can only store a pending character.

This should also solve https://github.com/jyn514/saltwater/issues/361.

jyn514 commented 4 years ago

Fixed in #478