Closed pmetzger closed 5 years ago
I do agree with the small change.
As for the big change, I actually don't know what is the best thing to do: It is common practice (at least when using ocamllex
) to use a hash table to avoid making the size of the automaton too large (There is a limit of 32768 states in a ocamllex automaton).
However, I am not fully convinced by that argument, because the automaton size should not grow more than linearly with respect to the sum of the sizes of keywords, which is not much, even for large languages...
Another argument would be the size of the transition table: If it gets too large, then it does not fit in the cache, and lexing suddenly becomes slower. Do you have benchmarks showing it is not slower, evn for languages with a fair amount of comments?
@fpottier, any thought?
Just to be clear: I do agree with you that the change you are proposing is valid for this parser. However, if the goal of this parser is to somehow provide a demo (for writing a compliant C parser, but also for ocamllex/menhir) , then I would prefer using methodologies that scale well.
It seems to me that hardcoding the keywords in the lexer (as opposed to listing them in a separate hash table) is in principle the right thing to do. It should in theory be faster.
The cache argument cuts both ways: a large transition table may not fit in the data cache, but a large hash table won't fit either.
Besides, ocamllex
has two modes: in the normal mode, it produces code that interprets a transition table, and in -ml
mode, it compiles the automaton directly to OCaml code. It would be worth testing both modes to see which one produces the smallest code and which one seems fastest.
Alright. Then I'll merge this and wait for a real, concrete reason for not hard-coding keywords in the lexer. That just seem simpler that way.
Big change: convert from using a hash table for recognizing keywords to just throwing it in to the regular expression engine. This is conceptually clearer, and given that there are fairly few keywords, probably as or more efficient.
Small change: make a trivial modification to keep lines under 80 columns.
I'm probably going to make a few more largely cosmetic changes to the lexer for readability as well. I'm gearing up to add a lexer for preprocessor tokens in the same file in preparation for constructing a (hopefully correct) implementation of the preprocessor, and I'm trying to get things pretty. (If this isn't of interest, I can do the work in my own fork.)
[Edited to add: I'm not 100% sure that keeping the cpp tokenizer in the same file is a good idea. I'll figure that bit out as I try it...]