Open AlgorithMan-de opened 8 years ago
Here, I translated it to php pflex.txt
With the generated tokenizer, the main calc.php becomes as simple as this calc.txt
I don't understand the advantage of selecting the longest match.
My own tokenize function evaluates the regular expressions in a logical order without the chance of getting the wrong regex because of different lengths.
It is also possible to split functions calls into smaller tokens.
E.g. I implemented the syntax
IFERROR( exp ; exp )
in the grammar like this:
IF ERROR '(' exp ';' exp ')'
This way you run only as many regexes as necessary because the first hit is always the right one.
Are there ambiguous regex combinations which would need the longest match? (even with lookaround?)
Specified order would seem to be the more useful approach to me. In PHP, for example, reserved words (if
, class
) would otherwise be interpreted as function or constant names if the lexer didn't respect the order, I think. The regexes for both would match the same code.
In which "logical order" do you test for variable names and keywords? Variable names first will cause "class", "if", "while" etc. to be counted as variable names. Keywords first causes variable names that happen to start with something that is also a keyword to be counted as the keyword, followed by a variable name. So then you'd have to bake parts of the syntax into the regexps (like "if" is only a keyword if it is followed by a space... or a tab or linefeed or parenthesis... you'll never see the end of these cases.)
The issue of “Keywords first causes variable names that happen to start with something that is also a keyword to be counted as the keyword” can be avoided by only matching on word boundaries with \b
.
But, yes, come to think of it, simply matching the longest token would be simpler.
You'd still need to handle the case of keywords, though. if
would be matched as both an identifier and a keyword. Would the latter take priority if it was specified earlier?
Yes, because it chooses the FIRST longest match - because a regexp is chosen if its token is > than the previous one - not >= . That means it selects the first regexp that has a longest match - later regexps can only be chosen if they give you a longer match. So you only need to put keywords before identifiers. That is the commonly used approach at writing tokenizers (see lex, flex, jflex, etc)
Alright, that seems reasonable.
keywords
^((class|if|while)(?![_a-zA-Z])|(else(?=[^_a-zA-Z]|if[^_a-zA-Z])))
comment till the end of line
^(\/\/[\S \r\t\f ]*)
inline comment
^(\/\*[\S\s]*?\*\/)
variable identifier, operators, parentheses
^([$\.;()+\-\/*=<>!:;])
item name (variable name, constant name, function name)
^([_a-zA-Z]*)
I use a grammar with numbers, {variable name}
, functions( exp [; ... ] )
and mathematical operations so I won't need the complexity of the PHP language.
I suppose, that there are better and more effective ways of lexing a complex language like PHP.
Selecting the first longest match is the most useful way to go. It precludes a whole bunch of hard-to-debug issues.
I'll look into integrating something easy like this. Thanks!
combining the two, I have already made a relatively strong parser for a programming language :-)
I wrote a very very cheap lexer/tokenizer generator in BASH to accompany lime (took the manual one from your calc example as reference). Maybe you want to put it in? If you preferred it to be written in php - it should be translatable easily - it mostly consists of echos. The main advantage is that it selects first longest matches, not first matches.
pflex.txt calc.txt