A big and a small change to the lexer

jhjourdan / C11parser

A correct C89/C90/C99/C11/C18 parser written using Menhir and OCaml

Other

190 stars 16 forks source link

A big and a small change to the lexer #11

Closed pmetzger closed 5 years ago

pmetzger commented 5 years ago

Big change: convert from using a hash table for recognizing keywords to just throwing it in to the regular expression engine. This is conceptually clearer, and given that there are fairly few keywords, probably as or more efficient.
Small change: make a trivial modification to keep lines under 80 columns.

I'm probably going to make a few more largely cosmetic changes to the lexer for readability as well. I'm gearing up to add a lexer for preprocessor tokens in the same file in preparation for constructing a (hopefully correct) implementation of the preprocessor, and I'm trying to get things pretty. (If this isn't of interest, I can do the work in my own fork.)

[Edited to add: I'm not 100% sure that keeping the cpp tokenizer in the same file is a good idea. I'll figure that bit out as I try it...]

jhjourdan commented 5 years ago

I do agree with the small change.

As for the big change, I actually don't know what is the best thing to do: It is common practice (at least when using ocamllex) to use a hash table to avoid making the size of the automaton too large (There is a limit of 32768 states in a ocamllex automaton).

However, I am not fully convinced by that argument, because the automaton size should not grow more than linearly with respect to the sum of the sizes of keywords, which is not much, even for large languages...

Another argument would be the size of the transition table: If it gets too large, then it does not fit in the cache, and lexing suddenly becomes slower. Do you have benchmarks showing it is not slower, evn for languages with a fair amount of comments?

@fpottier, any thought?

jhjourdan commented 5 years ago

Just to be clear: I do agree with you that the change you are proposing is valid for this parser. However, if the goal of this parser is to somehow provide a demo (for writing a compliant C parser, but also for ocamllex/menhir) , then I would prefer using methodologies that scale well.

fpottier commented 5 years ago

It seems to me that hardcoding the keywords in the lexer (as opposed to listing them in a separate hash table) is in principle the right thing to do. It should in theory be faster.

The cache argument cuts both ways: a large transition table may not fit in the data cache, but a large hash table won't fit either.

Besides, ocamllex has two modes: in the normal mode, it produces code that interprets a transition table, and in -ml mode, it compiles the automaton directly to OCaml code. It would be worth testing both modes to see which one produces the smallest code and which one seems fastest.

jhjourdan commented 5 years ago

Alright. Then I'll merge this and wait for a real, concrete reason for not hard-coding keywords in the lexer. That just seem simpler that way.