Small engineering issue in the front-end

[Sorry if this is a duplicate; I swear I posted this a while back, but now I don't see it.]

As background: P4_16's Bison parser relies the "lexer hack." That is, the parser maintains a symbol table that records, for each identifier, whether it is the name of a type or just an ordinary name. This symbol table is consulted by the lexer to produce two distinct tokens IDENTIFIER and TYPE_IDENTIFIER. And these tokens are treated differently in the parser.

Currently, the P4_16 front-end receives the entire program -- i.e., after the C preprocessor has run. And the declarations are processed from start to end. So, per language in the spec,

The order of declarations is important; with the exception of parser states, all uses of a symbol must follow the symbol's declaration. (This is a departure from P414, which allows declarations in any order. This requirement significantly simplifies the implementation of compilers for P4, allowing compilers to use additional information about declared identifiers to resolve ambiguities.)

the front-end can determine which identifiers denote types and which ones do not.

If we are developing a system where smaller program pieces are processed, we may need to refactor the front-end. Currently, if you point the parser at a file that starts like this:

parser MyParser(packet_in...)

you'll get a syntax error, because packet_in, which is declared in core.p4 is not known to be a type.

The problem may persist even if we imagine a syntax like this:

import core;
parser MyParser(core.packet_in...)

unless a side-effect of parsing the tokens import core causes the symbol table in the parser to realize that core.packet_in is now a type.

This is not a show-stopping issue, but there are some details to work out. For instance, one approach could be to write a lighter-weight front-end to parse and analyze just the import statements, and topologically sort them into the right order so references are known when they are encountered. (But there are questions about when the C pre-processor runs.) Alternatively, we could interrupt parsing when we get to an import and go off and actually parse and load the referenced module. (But that makes parsing even more effectful, and we also need to be super careful about introducing loops!) And probably there are other solutions...

What we can't do, is to load files using the existing parser (or a small extension to it) because the "lexer hack" means we don't even get an AST for program pieces like the snippet above.

There is another way of doing this. Rewriting the parser to be a top-down parser which will give a better control over even error messages. I don't think there is a modern C/C++ front-end which even uses bison these days even. I am actually working on my own front-end to parse P4-16 which could be reused for this. I think Marvell is willing to open source it and contribute it if wanted. The AST is different from the reference compiler and is still rough round the edges right now even. The tokenizer just does identifiers and then does looks while parsing if it is a type or not. I have been thinking about how to add namespace/modules support (it should not be hard; there is one case where there is an issue dealing parsing of "(type." vs "(type<>)" which will needed to be extended for modules but should not be hard).

jafingerhut / p4-namespaces

Small engineering issue in the front-end #22