Dynamically changing how tokens are recognized

maciejhirsz / logos

Create ridiculously fast Lexers

Apache License 2.0

2.81k stars 112 forks source link

I am writing parser generators and would like to experiment with using logos to create lexical scanners, which should be faster than the ones that I've coded. Before I proceed, however, I'd like to know if it's possible to change the way in which tokens are recognized at runtime. I have two examples of where I would need this kind of ability:

When parsing C using the published ANSI C grammar, if you have:
```
  typedef unsigned int uint;
  uint x = 2;
```
The first occurrence of "unit" should be recognized as an IDENTIFIER token but the second should be recognized as a TYPE_NAME. Confusing the two would lead to conflicts within the grammar. This can be done by constructing a symbol table as we parse, and writing "semantic actions" that can affect how the lexer recognizes certain alphanumeric sequences.
Another example is when parsing a generic type expression such as

HashMap<Keytype,HashSet<Valuetype>>

The problem here is that the ">>" symbol also represents the right-shift operator in many languages. But here, it should be recognized as two separate tokens. Most lexers, including logos as far as I can tell, will give priority to the longer match, so unless there's a space in between the two ">" symbols the expression would not be parsed properly. Here again, the parser needs to be able to instruct the lexer on how to recognize tokens dynamically, maybe even telling it to "back up a bit".

Are there solutions to these kinds of problems within logos?

Generally any time you want "x but context aware" the answer almost always is "do it in the parser".

Couple things that can help here:

For things like changing the token variant pending on previous token, you could try to hack something using extras.
For things like recognizing >> as either two separate angle brackets or right shift you can have a separate token enum defined for just generic types, and then use the morph method to switch between the two lexers when parsing generics or not. This currently is a bit unwieldy since morph requires you to take the previous lexer by owned value, once the rewrite is done it should be much nicer (see last release notes and #291).

That said for 2 I don't see why a parser when building AST for generics can't just accept > and >> as closing brackets, and for 1. you could also make the parser accept certain keywords as identifiers (or vice versa) depending on context (AFAIK in rustc all keywords are just idents and it's the parser that rejects idents that are keywords based on their value).

maciejhirsz / logos

Dynamically changing how tokens are recognized #296